TurboQuant hits 40 tok/s on 3080

// 90d agoBENCHMARK RESULT

TurboQuant hits 40 tok/s on 3080

TheTom's llama.cpp TurboQuant fork pairs turbo3 KV-cache compression with CUDA offload to run Qwen3.6-35B-A3B at roughly 40 tokens/s on a 12GB RTX 3080, even at 260K context. The post positions it as a practical long-context local inference setup rather than a pure benchmark flex.

// ANALYSIS

This is less about a single speed number and more about shifting the VRAM frontier for local 30B-plus models. TurboQuant matters here because it makes very long context usable on consumer hardware that would normally choke on the KV cache.

–turbo3 KV-cache compression appears to be the main unlock, since 260K context is usually the first thing that breaks on 12GB cards
–The result depends on a heavily tuned llama.cpp fork plus CUDA build flags, flash attention, and Q4_K_M weights, so it is not a drop-in win
–The practical value is for agentic workflows: faster ask, validate, review, refine loops matter more than isolated token throughput
–This is a strong signal that memory efficiency is now as important as raw kernel speed for local inference
–If pieces of this land upstream, Qwen3.x A3B-class models become far more viable on midrange GPUs

// TAGS

llama-cpp-turboquantturboquantllmgpuinferencebenchmarkopen-source

DISCOVERED

90d ago

2026-04-17

PUBLISHED

90d ago

2026-04-16

RELEVANCE

8/ 10

AUTHOR

herpnderpler

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE45m ago

Lightpanda agent REPL renders styled terminal markdown

Lightpanda has introduced a markdown-to-ANSI terminal renderer for its interactive agent REPL, styling headings, lists, inline formatting, and OSC 8 clickable links. The rendering is gated exclusively to interactive TTY sessions to avoid breaking machine-readable piped workflows.

VIDEO51m ago

Kimi K3 Teaser Hints at Hybrid Recurrent-Attention

Moonshot AI has released a teaser video for Kimi K3, prompting analysis of its architectural concepts. Visual metaphors in the video hint at a shift from Kimi K2's transformer backbone to a memory-efficient, recurrent hybrid architecture.

OPEN SOURCE1h ago

NextChat unifies Claude, DeepSeek, GPT-4, and Gemini Pro

NextChat (formerly ChatGPT-Next-Web) is a highly versatile, open-source AI client that provides a fast and unified interface for accessing top-tier LLMs like Claude, GPT-4, DeepSeek, and Gemini Pro. It is available across web, desktop, and iOS, features Model Context Protocol (MCP) support, and provides an enterprise edition with extensive brand customization options.