Creative Writing Benchmark Puts Ernie 5.1 Near Top

// 45d agoBENCHMARK RESULT

Creative Writing Benchmark Puts Ernie 5.1 Near Top

This GitHub benchmark evaluates short-fiction writing by having models respond to the same constrained creative briefs and then comparing the resulting stories head-to-head with evaluator LLMs. The latest leaderboard refresh adds Baidu Ernie 5.1, Qwen 3.7 Max, Mistral Medium 3.5, and Grok 4.3, with the reported scores placing Ernie 5.1 at -0.35, Qwen 3.7 Max at -2.01, Mistral Medium 3.5 at -2.13, and Grok 4.3 at -3.81. The benchmark also tracks compliance with the 600-800 word target range and measures how well stories incorporate the required elements.

// ANALYSIS

Strong signal for model-eval nerds: this is a more realistic creative-writing benchmark than a flat rubric because it compares stories directly, but the ranking is still relative to this specific comparison graph.

–The headline result is the lower-tier spread: Ernie 5.1 holds up materially better than Qwen 3.7 Max, Mistral Medium 3.5, and especially Grok 4.3.
–Because the score is pairwise and relative, small numeric gaps matter less than the comparison structure and confidence intervals.
–The benchmark’s 600-800 word compliance check is useful context, since creative writing quality here is tied to both form and content adherence.
–This is most relevant for teams evaluating model behavior on long-form generation, instruction following, and stylistic coherence rather than factual QA.

// TAGS

llmbenchmarkcreative-writingstory-generationevaluationpairwise-comparisongithubai-models

DISCOVERED

45d ago

2026-05-26

PUBLISHED

45d ago

2026-05-26

RELEVANCE

9/ 10

AUTHOR

zero0_one1

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE37m ago

Next.js remains premier React web framework

Next.js is an industry-standard, open-source React framework developed by Vercel for building server-side rendered and statically generated web applications. It features built-in asset optimizations, first-class TypeScript support, and a robust file-system App Router built on React Server Components.

INFRA38m ago

gRPC leads multi-language backend RPC systems

gRPC is an open-source, high-performance Remote Procedure Call (RPC) framework that simplifies building connected client-server systems. Using Protocol Buffers and HTTP/2, it supports cross-language code generation and transparent communication across a wide range of backend programming languages.

SECURITY48m ago

AI agent prompt injection triggers host RCE

Jyotirmoy Sundi, co-founder of Votal AI, demonstrated a proof-of-concept where a single prompt injection triggers Remote Code Execution on a host system running an autonomous AI agent. The exploit commands the agent to launch calc.exe, highlighting the need for strict runtime sandboxing in agentic architectures.