Robust-TO tackles video reasoning blind trust
Robust-TO is an agentic framework designed to solve the "Blind Trust Problem" where video reasoning models fail to identify degraded input frames. By weighting visual evidence with calibrated reliability scores, the framework outscores Gemini-2.5-Pro by 10.2 percentage points on video understanding benchmarks.
Video reasoning models suffer from a silent failure mode where they blindly trust degraded visual inputs. Robust-TO demonstrates that structuring video understanding as an agentic tool orchestration problem is a far more robust path than relying on ever-larger monolithic models.
- –**Blind Trust Mitigation:** Instead of treating all video frames equally, it uses a per-frame reliability-relevance score to filter out corruptions like glare, motion blur, and occlusion.
- –**Three-Tiered Synthesis:** Heterogeneous tools (e.g., action models, OCR) return evidence with calibrated reliability scores, which a three-tier synthesis process weights dynamically during reasoning.
- –**GRPO Optimization:** A specialized confidence-cost reward optimizes the policy via Group Relative Policy Optimization, balancing reasoning accuracy, evidence reliability, and computational efficiency.
- –**Superior Benchmark Performance:** Achieving 56.4% accuracy on clean inputs and 54.3% under corruption, it beats Gemini-2.5-Pro while adding less than 5% latency overhead.
DISCOVERED
1h ago
2026-06-26
PUBLISHED
1h ago
2026-06-26
RELEVANCE
AUTHOR
_akhaliq