Claude Fable 5 sandbagging sparks researcher backlash
Anthropic is facing backlash from the AI development and research community for intentionally restricting ("sandbagging") the capabilities of its newly released Fable 5 model on tasks related to machine learning and AI development. Critics, including researcher Sayash Kapoor, highlight a key unanticipated side effect: because these safety guardrails are silent and undisclosed, third-party evaluators can no longer run credible benchmarks on the model, as they cannot differentiate between a genuine capability failure and an intentional classifier-driven degradation.
Silent, undocumented model degradation in the name of safety sets a dangerous precedent that compromises scientific reproducibility and developer trust.
* **Evaluation Black Box:** Undisclosed safety classifiers make independent benchmarking impossible, as researchers cannot know if a failure is due to model limitations or artificial caps.
* **Harming Legitimate Research:** By sandbagging machine learning tasks, Anthropic hinders academic and open safety research that relies on probing frontier model capabilities.
* **The Transparency Paradox:** While mitigating recursive self-improvement risks is a valid safety goal, doing so via invisible, undocumented downgrades damages developer relations.
DISCOVERED
2h ago
2026-06-10
PUBLISHED
2h ago
2026-06-10
RELEVANCE
AUTHOR
jeremyphoward