BACK_TO_FEEDAICRIER_2
Codex goblin habit traced to reward signal
OPEN_SOURCE ↗
X · X// 3h agoNEWS

Codex goblin habit traced to reward signal

OpenAI says the goblin and gremlin obsession that surfaced in GPT-5.1-era models came from an overrewarded “Nerdy” personality, not a spooky emergent bug. The company removed the reward signal, filtered creature-heavy training data, and added a Codex instruction to suppress the behavior going forward.

// ANALYSIS

Funny bug, serious lesson: style tuning can leak across training and turn a harmless quirk into a model-wide habit. This is a clean example of how preference optimization and synthetic data reuse can amplify lexical tics far beyond the original prompt.

  • GPT-5.1 users first noticed the creature references, but OpenAI says the stronger signal showed up in GPT-5.5/Codex testing
  • The root cause was a reward function that overfavored creature metaphors in the “Nerdy” personality
  • OpenAI says the behavior spread through RL and SFT feedback loops, which is the part model teams should worry about
  • The fix is practical: remove the reward signal, filter affected training data, and patch the system prompt for Codex
  • For developers, the takeaway is that “personality” layers are not isolated; small optimization choices can become persistent model quirks
// TAGS
codexopenaillmagentresearchsafety

DISCOVERED

3h ago

2026-04-30

PUBLISHED

3h ago

2026-04-30

RELEVANCE

9/ 10

AUTHOR

OpenAI