X · X// 3h agoNEWS

Codex goblin habit traced to reward signal

OpenAI says the goblin and gremlin obsession that surfaced in GPT-5.1-era models came from an overrewarded “Nerdy” personality, not a spooky emergent bug. The company removed the reward signal, filtered creature-heavy training data, and added a Codex instruction to suppress the behavior going forward.

// ANALYSIS

Funny bug, serious lesson: style tuning can leak across training and turn a harmless quirk into a model-wide habit. This is a clean example of how preference optimization and synthetic data reuse can amplify lexical tics far beyond the original prompt.

–GPT-5.1 users first noticed the creature references, but OpenAI says the stronger signal showed up in GPT-5.5/Codex testing
–The root cause was a reward function that overfavored creature metaphors in the “Nerdy” personality
–OpenAI says the behavior spread through RL and SFT feedback loops, which is the part model teams should worry about
–The fix is practical: remove the reward signal, filter affected training data, and patch the system prompt for Codex
–For developers, the takeaway is that “personality” layers are not isolated; small optimization choices can become persistent model quirks

// TAGS

codexopenaillmagentresearchsafety

DISCOVERED

3h ago

2026-04-30

PUBLISHED

3h ago

2026-04-30

RELEVANCE

9/ 10

AUTHOR

OpenAI