Claude Code backend chases local GPU speed
A LocalLLaMA thread asks whether any local model/GPU stack can approach 500 tokens/sec for a Claude Code backend. Replies say that target is unrealistic for 35B-class models on consumer hardware, and that perceived speed depends more on first-token latency, tool-call consistency, and context handling than raw throughput.
The thread’s real takeaway is that agent UX lives or dies on latency, not benchmark bragging rights. If Claude Code is the target, you optimize the whole loop: prompt load, context compression, model parallelism, and tool round trips.
- –500 tokens/sec for a 35B model is described as data-center territory, not a normal consumer-GPU outcome
- –One commenter says Qwen 3.5 35B A3B on a single 4090 is around 60 tokens/sec, but TTFT on 16K context is still about 2 seconds
- –Dual-GPU setups or smaller A3B/MoE variants with aggressive compression look more practical than chasing raw decode speed alone
- –Speculative decoding in llama.cpp is called out as a meaningful multiplier, with one commenter claiming roughly 1.5-4x speedups on some models
- –The hardware ceiling discussion points toward very high-bandwidth cards like RTX 5090 or RTX Pro 6000, plus models such as Nemotron-3-Nano or GLM-4.7-Flash
DISCOVERED
51d ago
2026-04-06
PUBLISHED
51d ago
2026-04-06
RELEVANCE
AUTHOR
SwordfishGreat4532

