OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
TurboQuant hits 40 tok/s on 3080
TheTom's llama.cpp TurboQuant fork pairs turbo3 KV-cache compression with CUDA offload to run Qwen3.6-35B-A3B at roughly 40 tokens/s on a 12GB RTX 3080, even at 260K context. The post positions it as a practical long-context local inference setup rather than a pure benchmark flex.
// ANALYSIS
This is less about a single speed number and more about shifting the VRAM frontier for local 30B-plus models. TurboQuant matters here because it makes very long context usable on consumer hardware that would normally choke on the KV cache.
- –turbo3 KV-cache compression appears to be the main unlock, since 260K context is usually the first thing that breaks on 12GB cards
- –The result depends on a heavily tuned llama.cpp fork plus CUDA build flags, flash attention, and Q4_K_M weights, so it is not a drop-in win
- –The practical value is for agentic workflows: faster ask, validate, review, refine loops matter more than isolated token throughput
- –This is a strong signal that memory efficiency is now as important as raw kernel speed for local inference
- –If pieces of this land upstream, Qwen3.x A3B-class models become far more viable on midrange GPUs
// TAGS
llama-cpp-turboquantturboquantllmgpuinferencebenchmarkopen-source
DISCOVERED
3h ago
2026-04-17
PUBLISHED
7h ago
2026-04-16
RELEVANCE
8/ 10
AUTHOR
herpnderpler