BACK_TO_FEEDAICRIER_2
TurboQuant hits 40 tok/s on 3080
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT

TurboQuant hits 40 tok/s on 3080

TheTom's llama.cpp TurboQuant fork pairs turbo3 KV-cache compression with CUDA offload to run Qwen3.6-35B-A3B at roughly 40 tokens/s on a 12GB RTX 3080, even at 260K context. The post positions it as a practical long-context local inference setup rather than a pure benchmark flex.

// ANALYSIS

This is less about a single speed number and more about shifting the VRAM frontier for local 30B-plus models. TurboQuant matters here because it makes very long context usable on consumer hardware that would normally choke on the KV cache.

  • turbo3 KV-cache compression appears to be the main unlock, since 260K context is usually the first thing that breaks on 12GB cards
  • The result depends on a heavily tuned llama.cpp fork plus CUDA build flags, flash attention, and Q4_K_M weights, so it is not a drop-in win
  • The practical value is for agentic workflows: faster ask, validate, review, refine loops matter more than isolated token throughput
  • This is a strong signal that memory efficiency is now as important as raw kernel speed for local inference
  • If pieces of this land upstream, Qwen3.x A3B-class models become far more viable on midrange GPUs
// TAGS
llama-cpp-turboquantturboquantllmgpuinferencebenchmarkopen-source

DISCOVERED

3h ago

2026-04-17

PUBLISHED

7h ago

2026-04-16

RELEVANCE

8/ 10

AUTHOR

herpnderpler