EvaluateAI Exposes Prompt Sensitivity Gaps
The maker of EvaluateAI ran the same math word problem through Qwen 3.5, Qwen 3.6, Gemma 4, and IQ2 in short and long forms, then repeated each run 10 times. The results show that tiny prompt changes can flip outcomes as much as model choice can.
The main takeaway is not just that some models are better than others, but that “same task” does not mean “same prompt behavior.” A benchmark that ignores phrasing style can overrate one model and unfairly punish another.
- –Qwen 3.6 looks less stable than 3.5 on this specific task, which is a reminder that newer releases can shift prompting behavior even when raw capability improves.
- –Gemma 4 appears more tolerant of narrative context, while Qwen 3.6 seems more likely to collapse into the wrong interpretation under fluffier wording.
- –Repeating each prompt 10 times matters; single-shot model comparisons hide variance and make the wrong failure mode look deterministic.
- –This is a strong argument for evals that include multiple prompt styles, not just one “canonical” version.
- –For local model testing, the lesson is practical: prompt engineering is model-specific, and the best prompt for one family can be the worst prompt for another.
DISCOVERED
2h ago
2026-05-07
PUBLISHED
3h ago
2026-05-07
RELEVANCE
AUTHOR
Excellent_Jelly2788