BACK_TO_FEEDAICRIER_2
SPLICE benchmark exposes VLMs' video reasoning failures
OPEN_SOURCE ↗
REDDIT · REDDIT// 27d agoRESEARCH PAPER

SPLICE benchmark exposes VLMs' video reasoning failures

Researchers published SPLICE at EMNLP 2025, a benchmark that shuffles instructional video clips and asks models to resequence them. Best-in-class models scored 51% on a task humans do at 85%, revealing that VLMs rely on language priors and visual pattern-matching rather than genuine temporal reasoning.

// ANALYSIS

VLMs aren't watching video — they're reading descriptions of video, and SPLICE makes that embarrassingly clear.

  • Best-performing model (Gemini 2.0 Flash) hit 51% vs. 85% human baseline — a 34-point gap that can't be explained away by task difficulty
  • Adding human-written text descriptions boosted model scores significantly but didn't affect human performance at all — definitive proof that models are leaning on language, not vision
  • Models predicted visually similar clips were adjacent 57% of the time; humans did so only 2.5% of the time, and random chance is 27% — models are running image similarity, not event reasoning
  • Scaling the language model doesn't fix the vision bottleneck: Qwen2-VL-7B matched the 72B variant on pure visual reasoning, meaning bigger LMs can't compensate for weak vision encoders
  • Open-source models fared worst: LLaVA-OneVision-72B barely cleared random guessing in vision-only mode, and Qwen2-VL-72B outperformed Gemini on text-only — a damning inversion
// TAGS
splicebenchmarkmultimodalllmreasoningresearch

DISCOVERED

27d ago

2026-03-15

PUBLISHED

27d ago

2026-03-15

RELEVANCE

8/ 10

AUTHOR

prokajevo