OPEN_SOURCE ↗
X · X// 3h agoNEWS
Codex goblin habit traced to reward signal
OpenAI says the goblin and gremlin obsession that surfaced in GPT-5.1-era models came from an overrewarded “Nerdy” personality, not a spooky emergent bug. The company removed the reward signal, filtered creature-heavy training data, and added a Codex instruction to suppress the behavior going forward.
// ANALYSIS
Funny bug, serious lesson: style tuning can leak across training and turn a harmless quirk into a model-wide habit. This is a clean example of how preference optimization and synthetic data reuse can amplify lexical tics far beyond the original prompt.
- –GPT-5.1 users first noticed the creature references, but OpenAI says the stronger signal showed up in GPT-5.5/Codex testing
- –The root cause was a reward function that overfavored creature metaphors in the “Nerdy” personality
- –OpenAI says the behavior spread through RL and SFT feedback loops, which is the part model teams should worry about
- –The fix is practical: remove the reward signal, filter affected training data, and patch the system prompt for Codex
- –For developers, the takeaway is that “personality” layers are not isolated; small optimization choices can become persistent model quirks
// TAGS
codexopenaillmagentresearchsafety
DISCOVERED
3h ago
2026-04-30
PUBLISHED
3h ago
2026-04-30
RELEVANCE
9/ 10
AUTHOR
OpenAI