Pelican Test Expands Into Video
The post proposes a video version of the long-running Pelican Test: give a multimodal model a short clip and ask it to write JavaScript that reproduces the animation as closely as possible. It compares outputs from Gemini 3.1 Pro, K2.5, Qwen 3.6 Plus, and Gemma 4 31B to show how well current VLLMs handle spatial reasoning and visual reconstruction.
This is a decent hacky benchmark because it punishes shallow captioning and rewards actual video understanding plus layout-aware code generation.
- –It shifts the test from static SVG composition to temporal reconstruction, which is harder and more revealing for multimodal models.
- –The real signal here is spatial fidelity: can the model preserve text placement, motion, edits, and transitions without hand-holding?
- –The prompt is still informal and noisy, so it’s better as a vibes benchmark than a rigorous eval suite.
- –Interesting that the author highlights line positioning; that usually exposes whether the model is actually parsing structure or just pattern-matching aesthetics.
- –If this catches on, expect people to use it as a quick litmus test for video-capable models, especially in local/VLLM circles.
DISCOVERED
45d ago
2026-04-17
PUBLISHED
45d ago
2026-04-17
RELEVANCE
AUTHOR
TheRealMasonMac