llama.cpp-mtp hits 80+ tok/s at 262K context
A custom llama.cpp fork combines multi-token prediction (MTP) with TurboQuant's TBQ4_0 KV cache compression on Qwen3.6-27B. The author reports improving throughput from about 43 tok/s to 80-87 tok/s with roughly 73% draft acceptance on a single RTX 4090 under Ubuntu 24.04 and CUDA 12.x.
Strong hobbyist benchmark and a useful signal for local-LLM enthusiasts, but it reads more like a performance experiment than a polished product launch.
- –The headline result is the combination: long context, MTP, and TurboQuant KV compression on consumer hardware.
- –The claimed speedup is meaningful, especially if the 80+ tok/s figure reproduces outside the author’s machine.
- –The setup is highly specialized: forked runtime, grafted MTP heads, and a specific model/quantization stack.
- –Quality claims are still anecdotal; independent reproduction would matter before treating this as a generally reliable recipe.
DISCOVERED
4h ago
2026-05-09
PUBLISHED
7h ago
2026-05-08
RELEVANCE
AUTHOR
indrasmirror
