OmniGAIA benchmarks omni-modal agent reasoning
OmniGAIA is a new research benchmark for agents that have to reason across video, audio, and images while using tools like web search and code execution. The project also ships OmniAtlas, an active-perception agent framework plus open-source code, datasets, leaderboard, and model checkpoints on GitHub and Hugging Face.
This is the kind of paper that matters because it attacks a real weakness in multimodal AI: most systems still reason in pairs of modalities, not across the full messy stack of media developers actually deal with. OmniGAIA stands out by pairing a harder benchmark with a concrete agent framework, which makes it more useful than yet another leaderboard-only release.
- –The benchmark is built around an omni-modal event graph, so tasks are explicitly designed to require multi-hop reasoning across image, audio, and video instead of shallow captioning-style pattern matching.
- –OmniAtlas adds active perception, meaning the agent can request additional media segments during reasoning rather than passively consuming a fixed prompt.
- –The benchmark stats are a strong signal of difficulty: 98.6% of tasks require web search and 74.4% require code or computation, pushing closer to real agent workflows.
- –The team released code, benchmark assets, a public leaderboard, and several OmniAtlas checkpoints, which gives the paper a better chance of becoming an actual reference point for multimodal agent evaluation.
DISCOVERED
36d ago
2026-03-06
PUBLISHED
36d ago
2026-03-06
RELEVANCE
AUTHOR
Discover AI