SPLICE benchmark exposes VLMs' video reasoning failures

// 74d agoRESEARCH PAPER

SPLICE benchmark exposes VLMs' video reasoning failures

Researchers published SPLICE at EMNLP 2025, a benchmark that shuffles instructional video clips and asks models to resequence them. Best-in-class models scored 51% on a task humans do at 85%, revealing that VLMs rely on language priors and visual pattern-matching rather than genuine temporal reasoning.

// ANALYSIS

VLMs aren't watching video — they're reading descriptions of video, and SPLICE makes that embarrassingly clear.

–Best-performing model (Gemini 2.0 Flash) hit 51% vs. 85% human baseline — a 34-point gap that can't be explained away by task difficulty
–Adding human-written text descriptions boosted model scores significantly but didn't affect human performance at all — definitive proof that models are leaning on language, not vision
–Models predicted visually similar clips were adjacent 57% of the time; humans did so only 2.5% of the time, and random chance is 27% — models are running image similarity, not event reasoning
–Scaling the language model doesn't fix the vision bottleneck: Qwen2-VL-7B matched the 72B variant on pure visual reasoning, meaning bigger LMs can't compensate for weak vision encoders
–Open-source models fared worst: LLaVA-OneVision-72B barely cleared random guessing in vision-only mode, and Qwen2-VL-72B outperformed Gemini on text-only — a damning inversion

// TAGS

splicebenchmarkmultimodalllmreasoningresearch

DISCOVERED

74d ago

2026-03-15

PUBLISHED

74d ago

2026-03-15

RELEVANCE

8/ 10

AUTHOR

prokajevo

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

VIDEO3h ago

Viral video teases Claude Opus 4.8

A viral video directed by Miguel07Code showcases impressive "hyperframes" camera movements, allegedly generated by Claude Opus 4.8. The post has sparked speculation about Claude's video generation capabilities.

LAUNCH3h ago

Browser Use Terminal launches Rust web-agent TUI

Browser Use Terminal is a new Rust-based TUI that lets developers automate and steer browser tasks directly from the command line. It combines a lightweight LLM harness with direct CDP control over Chrome for highly observable, interactive automation.

NEWS4h ago

Developer automates BTC trading with Claude, nets profit

A developer tasked Claude with a $20 budget to autonomously trade Bitcoin overnight, resulting in a completed script that successfully executed five trades for a $95 profit. The experiment showcases the increasing capability of LLMs to generate functional, profitable algorithmic trading systems with minimal oversight.