Creative Writing Benchmark Puts Ernie 5.1 Near Top
This GitHub benchmark evaluates short-fiction writing by having models respond to the same constrained creative briefs and then comparing the resulting stories head-to-head with evaluator LLMs. The latest leaderboard refresh adds Baidu Ernie 5.1, Qwen 3.7 Max, Mistral Medium 3.5, and Grok 4.3, with the reported scores placing Ernie 5.1 at -0.35, Qwen 3.7 Max at -2.01, Mistral Medium 3.5 at -2.13, and Grok 4.3 at -3.81. The benchmark also tracks compliance with the 600-800 word target range and measures how well stories incorporate the required elements.
Strong signal for model-eval nerds: this is a more realistic creative-writing benchmark than a flat rubric because it compares stories directly, but the ranking is still relative to this specific comparison graph.
- –The headline result is the lower-tier spread: Ernie 5.1 holds up materially better than Qwen 3.7 Max, Mistral Medium 3.5, and especially Grok 4.3.
- –Because the score is pairwise and relative, small numeric gaps matter less than the comparison structure and confidence intervals.
- –The benchmark’s 600-800 word compliance check is useful context, since creative writing quality here is tied to both form and content adherence.
- –This is most relevant for teams evaluating model behavior on long-form generation, instruction following, and stylistic coherence rather than factual QA.
DISCOVERED
1h ago
2026-05-26
PUBLISHED
4h ago
2026-05-26
RELEVANCE
AUTHOR
zero0_one1