YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

SPLICE benchmark exposes VLMs' video reasoning failures

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

SPLICE benchmark exposes VLMs' video reasoning failures
OPEN LINK ↗
// 73d agoRESEARCH PAPER

SPLICE benchmark exposes VLMs' video reasoning failures

Researchers published SPLICE at EMNLP 2025, a benchmark that shuffles instructional video clips and asks models to resequence them. Best-in-class models scored 51% on a task humans do at 85%, revealing that VLMs rely on language priors and visual pattern-matching rather than genuine temporal reasoning.

// ANALYSIS

VLMs aren't watching video — they're reading descriptions of video, and SPLICE makes that embarrassingly clear.

  • Best-performing model (Gemini 2.0 Flash) hit 51% vs. 85% human baseline — a 34-point gap that can't be explained away by task difficulty
  • Adding human-written text descriptions boosted model scores significantly but didn't affect human performance at all — definitive proof that models are leaning on language, not vision
  • Models predicted visually similar clips were adjacent 57% of the time; humans did so only 2.5% of the time, and random chance is 27% — models are running image similarity, not event reasoning
  • Scaling the language model doesn't fix the vision bottleneck: Qwen2-VL-7B matched the 72B variant on pure visual reasoning, meaning bigger LMs can't compensate for weak vision encoders
  • Open-source models fared worst: LLaVA-OneVision-72B barely cleared random guessing in vision-only mode, and Qwen2-VL-72B outperformed Gemini on text-only — a damning inversion
// TAGS
splicebenchmarkmultimodalllmreasoningresearch

DISCOVERED

73d ago

2026-03-15

PUBLISHED

73d ago

2026-03-15

RELEVANCE

8/ 10

AUTHOR

prokajevo