Local LLMs fail 18th-century calendar test
A historical genealogy puzzle shared on r/LocalLLaMA reveals a persistent reasoning gap in consumer-grade open models, which fail to account for the Julian calendar's March 25th New Year when evaluating 1740s birth records. While frontier models correctly identify the 14-month gap between November 1746 and January 1747, local models often hallucinate medical miracles or dismiss the records as typos.
This "tricky question" serves as a sharp reminder that benchmark scores often mask fundamental reasoning failures in smaller models. Consumer-grade models running on hardware like 64GB Macs frequently fail to connect the 1752 calendar shift to specific date calculations, with failure modes ranging from "medical miracles" to circular reasoning. The gap highlights the difference between a model knowing a fact and reasoning with it, as even high-performing open-weight models like Qwen struggle when historical context overrides modern defaults.
DISCOVERED
8d ago
2026-04-03
PUBLISHED
8d ago
2026-04-03
RELEVANCE
AUTHOR
Murgatroyd314