OPEN_SOURCE ↗
REDDIT · REDDIT// 27d agoRESEARCH PAPER
SPLICE benchmark exposes VLMs' video reasoning failures
Researchers published SPLICE at EMNLP 2025, a benchmark that shuffles instructional video clips and asks models to resequence them. Best-in-class models scored 51% on a task humans do at 85%, revealing that VLMs rely on language priors and visual pattern-matching rather than genuine temporal reasoning.
// ANALYSIS
VLMs aren't watching video — they're reading descriptions of video, and SPLICE makes that embarrassingly clear.
- –Best-performing model (Gemini 2.0 Flash) hit 51% vs. 85% human baseline — a 34-point gap that can't be explained away by task difficulty
- –Adding human-written text descriptions boosted model scores significantly but didn't affect human performance at all — definitive proof that models are leaning on language, not vision
- –Models predicted visually similar clips were adjacent 57% of the time; humans did so only 2.5% of the time, and random chance is 27% — models are running image similarity, not event reasoning
- –Scaling the language model doesn't fix the vision bottleneck: Qwen2-VL-7B matched the 72B variant on pure visual reasoning, meaning bigger LMs can't compensate for weak vision encoders
- –Open-source models fared worst: LLaVA-OneVision-72B barely cleared random guessing in vision-only mode, and Qwen2-VL-72B outperformed Gemini on text-only — a damning inversion
// TAGS
splicebenchmarkmultimodalllmreasoningresearch
DISCOVERED
27d ago
2026-03-15
PUBLISHED
27d ago
2026-03-15
RELEVANCE
8/ 10
AUTHOR
prokajevo