llama.cpp boosts Kimi Linear throughput, context

// 129d agoINFRASTRUCTURE

llama.cpp boosts Kimi Linear throughput, context

A merged llama.cpp pull request changes Kimi Linear's chunk size to 16, improving prompt-processing throughput by roughly 15% in the posted benchmarks while cutting CUDA buffer usage enough to push context higher on an RTX 3090. The Reddit post claims even bigger real-world gains, framing it as a tiny code change with outsized impact for long-context local inference.

// ANALYSIS

This is the kind of unglamorous systems work that actually moves local LLMs forward: better throughput and more context without buying new hardware.

–The biggest win is on prompt processing, which matters most when you are stuffing huge contexts into local models
–The PR's VRAM savings are notable because less compute-buffer overhead directly translates into more usable context length on consumer GPUs
–llama.cpp keeps proving that model-specific optimizations can materially improve inference even after a model is already "supported"
–The headline 30% figure looks like an upper-end community report; the merged PR itself shows a more conservative but still meaningful ~15% benchmark gain
–For LocalLLaMA users, this is a practical reminder that upstream merges can change what quants and context lengths are viable on the same card

// TAGS

llama-cppinferencegpuopen-sourcellm

DISCOVERED

129d ago

2026-03-06

PUBLISHED

129d ago

2026-03-06

RELEVANCE

8/ 10

AUTHOR

Ok_Warning2146

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL21m ago

OpenAI GPT-5.6 hits Amazon Bedrock

OpenAI's GPT-5.6 model family—including Sol, Terra, and Luna—is now generally available on Amazon Bedrock. Running on Bedrock's next-generation inference engine, the models support prompt caching with a 90% discount and match OpenAI's first-party pricing.

UPDATE1h ago

OpenRouter splits rankings by model weight

OpenRouter has updated its rankings platform by introducing separate leaderboards for open-weight and closed-weight models. This allows developers to track and compare usage statistics of proprietary, API-exclusive models against downloadable open-weight models.

UPDATE1h ago

Codex and Claude Code introduce advanced in-app browser capabilities, including multi-tab support and cookie imports, accelerating the shift toward autonomous computer use.

Codex has updated its in-app browser to support multiple tabs, cookie importing, and password persistence, with Anthropic's Claude Code quickly following with similar web-browsing capabilities. These upgrades allow AI agents to navigate authenticated sites and perform browser-based tasks alongside code editors and terminals. By embedding robust browser control directly into the agentic environment, developers can execute end-to-end workflows without leaving the command line or workspace app.