Hostile prompts cut LLM performance across scales
New research across 14 model configurations reveals a 5-13% drop in instruction-following performance when models face hostile user prompts. This "hostility residual" persists from 0.6B to 123B parameters, suggesting that scaling alone cannot solve model sensitivity to aggressive prompt framing.
Scaling isn't the silver bullet for model robustness; if your user is mean, your model is likely to fail.
- –The effect is universal across architecture (Dense vs MoE) and quantization (FP16 vs Q4), indicating it is a fundamental property of current LLM training paradigms.
- –Larger models like Mistral Large 123B show attenuation but remain significantly vulnerable, debunking the idea that simply adding parameters cures sensitivity.
- –Instruction tuning actually amplifies hostility sensitivity in models like Llama 3.1, raising questions about how RLHF and safety training impact behavioral stability.
- –The emergence of extreme position bias in specific configurations (like Mistral 7B Q4) under hostile framing suggests quantization can cause unpredictable distributional collapses.
DISCOVERED
45d ago
2026-04-24
PUBLISHED
45d ago
2026-04-24
RELEVANCE
AUTHOR
Saraozte01