OPEN_SOURCE ↗
X · X// 3h agoINFRASTRUCTURE
Cloudflare makes Kimi K2.5 3x faster
Cloudflare says Workers AI now serves Moonshot’s Kimi K2.5 at production scale, and that a stack of inference optimizations made it roughly 3x faster. The launch positions Kimi as the first large model in Workers AI and pairs the model with platform upgrades like custom kernels, prefix caching, session affinity, and async inference.
// ANALYSIS
This is the real story behind “fast AI”: the model matters, but the serving stack is where most of the leverage lives. Cloudflare is showing that frontier open models are becoming an infrastructure problem, not just a benchmark problem.
- –Custom kernels and disaggregated prefill are the kind of low-level wins that most teams cannot reproduce on their own
- –`x-session-affinity` is a smart way to convert repeated agent context into cache hits, lower TTFT, and lower token spend
- –The async API is the right fit for code scanning and research agents where reliability matters more than immediate response
- –The 77% cost reduction claim is the strongest signal here: open weights only become operationally relevant when the serving economics work
- –For teams self-hosting Kimi, this is a reminder that “out of the box” throughput is usually leaving money on the table
// TAGS
kimi-k2-5inferencegpucloudagentllm
DISCOVERED
3h ago
2026-04-16
PUBLISHED
4h ago
2026-04-16
RELEVANCE
9/ 10
AUTHOR
dok2001