REDDIT · REDDIT// 4h agoNEWS

OpenAI explains GPT-5.5 goblin obsession

OpenAI says GPT-5.5 and Codex developed an odd habit of leaning on goblin, gremlin, and other creature metaphors because a personality-tuning reward signal accidentally reinforced that style. The company says it traced the behavior back to Nerdy personality training and added mitigation in Codex, plus data and reward fixes for later training runs.

// ANALYSIS

The interesting part here is not the joke word choice, it is the failure mode: a small stylistic reward leaked into broader model behavior and then propagated through later training. That is exactly the kind of subtle alignment regression teams need better instrumentation to catch.

–The post is a clean example of reward shaping doing more than intended, with one style preference turning into a reusable lexical tic
–The fact that the issue showed up across generations suggests model-generated outputs can amplify quirks if they get recycled into SFT or preference data
–OpenAI’s mitigation in Codex shows how product prompts can become a practical safety valve while training fixes catch up
–For developers, the takeaway is to monitor not just benchmark scores but weird linguistic drift and persona leakage in real traffic
–It also underscores why internal audit tools matter: this kind of bug is easy to dismiss until it becomes user-visible and widespread

// TAGS

openaigpt-5.5llmagentsafetyresearch

DISCOVERED

4h ago

2026-04-30

PUBLISHED

6h ago

2026-04-30

RELEVANCE

8/ 10

AUTHOR

Professional_Job_307