OPEN_SOURCE ↗
REDDIT · REDDIT// 25d agoRESEARCH PAPER
CRYSTAL benchmark exposes hidden reasoning gaps
CRYSTAL is a 6,372-example benchmark for multimodal reasoning that scores step-level trace quality, not just final answers. The paper evaluates 20 models and argues that accuracy alone badly overstates how well they actually reason.
// ANALYSIS
This is a strong correction to the “right answer means right reasoning” myth. The ugly takeaway is that many models can sound plausible while missing most of the steps that actually justify the answer.
- –The benchmark is more useful than plain accuracy because it measures both step-level Match F1 and ordered reasoning quality.
- –The headline result is harsh: 19 of 20 models cherry-pick a few correct steps and skip the rest, so high precision hides weak recall.
- –The size story is non-linear, with smaller models like Gemma3-4B outperforming much larger systems on the reasoning metric.
- –CPR-Curriculum is the most interesting technical contribution here because it rewards process, not just outcome, and reportedly stabilizes training where simpler rewards collapse.
- –The paper’s own caveats are real: reference traces are consensus paths, borderline semantic matches are fuzzy, and causal dependency inside reasoning chains is still unsolved.
// TAGS
crystalmultimodalreasoningbenchmarkresearchopen-source
DISCOVERED
25d ago
2026-03-18
PUBLISHED
25d ago
2026-03-18
RELEVANCE
9/ 10
AUTHOR
waybarrios