YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

TurboQuant hits 40 tok/s on 3080

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

TurboQuant hits 40 tok/s on 3080
OPEN LINK ↗
// 45d agoBENCHMARK RESULT

TurboQuant hits 40 tok/s on 3080

TheTom's llama.cpp TurboQuant fork pairs turbo3 KV-cache compression with CUDA offload to run Qwen3.6-35B-A3B at roughly 40 tokens/s on a 12GB RTX 3080, even at 260K context. The post positions it as a practical long-context local inference setup rather than a pure benchmark flex.

// ANALYSIS

This is less about a single speed number and more about shifting the VRAM frontier for local 30B-plus models. TurboQuant matters here because it makes very long context usable on consumer hardware that would normally choke on the KV cache.

  • turbo3 KV-cache compression appears to be the main unlock, since 260K context is usually the first thing that breaks on 12GB cards
  • The result depends on a heavily tuned llama.cpp fork plus CUDA build flags, flash attention, and Q4_K_M weights, so it is not a drop-in win
  • The practical value is for agentic workflows: faster ask, validate, review, refine loops matter more than isolated token throughput
  • This is a strong signal that memory efficiency is now as important as raw kernel speed for local inference
  • If pieces of this land upstream, Qwen3.x A3B-class models become far more viable on midrange GPUs
// TAGS
llama-cpp-turboquantturboquantllmgpuinferencebenchmarkopen-source

DISCOVERED

45d ago

2026-04-17

PUBLISHED

45d ago

2026-04-16

RELEVANCE

8/ 10

AUTHOR

herpnderpler