llama.cpp-mtp hits 80+ tok/s at 262K context

// 4h agoBENCHMARK RESULT

llama.cpp-mtp hits 80+ tok/s at 262K context

A custom llama.cpp fork combines multi-token prediction (MTP) with TurboQuant's TBQ4_0 KV cache compression on Qwen3.6-27B. The author reports improving throughput from about 43 tok/s to 80-87 tok/s with roughly 73% draft acceptance on a single RTX 4090 under Ubuntu 24.04 and CUDA 12.x.

// ANALYSIS

Strong hobbyist benchmark and a useful signal for local-LLM enthusiasts, but it reads more like a performance experiment than a polished product launch.

–The headline result is the combination: long context, MTP, and TurboQuant KV compression on consumer hardware.
–The claimed speedup is meaningful, especially if the 80+ tok/s figure reproduces outside the author’s machine.
–The setup is highly specialized: forked runtime, grafted MTP heads, and a specific model/quantization stack.
–Quality claims are still anecdotal; independent reproduction would matter before treating this as a generally reliable recipe.

// TAGS

llama-cpp-mtpmtpturboquantqwen3-6kv-cachespeculative-decodinglocal-firstbenchmarkrtx-4090

DISCOVERED

4h ago

2026-05-09

PUBLISHED

7h ago

2026-05-08

RELEVANCE

8/ 10

AUTHOR

indrasmirror

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

BENCHMARK27m ago

Hermes Agent tops OpenRouter rankings

OpenRouter's app leaderboard now puts Hermes Agent at #1, spotlighting Nous Research's open-source, persistent AI agent. The signal matters because it reflects real usage at scale, not just launch-day hype.

BENCHMARK59m ago

Qwen3-Coder-Next impresses local model users

This Reddit post is a local-inference comparison, not a formal launch writeup: the author says Qwen3-Coder-Next on MLX feels faster than their previous quickest model and produces better output than several much larger local models. The takeaway is that it may be a strong sweet spot for Apple Silicon users who want serious coding capability without paying the latency tax of giant checkpoints.

OPEN SOURCE1h ago

DeepSeek-TUI sharpens terminal coding flows

DeepSeek TUI is an open-source terminal coding agent for DeepSeek V4 that can read and edit files, run shell commands, search the web, manage git, and coordinate sub-agents. The latest release, v0.8.22, landed on May 8, 2026 and adds polish around locale handling, session behavior, Docker distribution, and install reliability.

llama.cpp-mtp hits 80+ tok/s at 262K context