OPEN_SOURCE ↗
REDDIT · REDDIT// 6d agoINFRASTRUCTURE
Claude Code backend chases local GPU speed
A LocalLLaMA thread asks whether any local model/GPU stack can approach 500 tokens/sec for a Claude Code backend. Replies say that target is unrealistic for 35B-class models on consumer hardware, and that perceived speed depends more on first-token latency, tool-call consistency, and context handling than raw throughput.
// ANALYSIS
The thread’s real takeaway is that agent UX lives or dies on latency, not benchmark bragging rights. If Claude Code is the target, you optimize the whole loop: prompt load, context compression, model parallelism, and tool round trips.
- –500 tokens/sec for a 35B model is described as data-center territory, not a normal consumer-GPU outcome
- –One commenter says Qwen 3.5 35B A3B on a single 4090 is around 60 tokens/sec, but TTFT on 16K context is still about 2 seconds
- –Dual-GPU setups or smaller A3B/MoE variants with aggressive compression look more practical than chasing raw decode speed alone
- –Speculative decoding in llama.cpp is called out as a meaningful multiplier, with one commenter claiming roughly 1.5-4x speedups on some models
- –The hardware ceiling discussion points toward very high-bandwidth cards like RTX 5090 or RTX Pro 6000, plus models such as Nemotron-3-Nano or GLM-4.7-Flash
// TAGS
claude-codeinferencegpucliagentllm
DISCOVERED
6d ago
2026-04-06
PUBLISHED
6d ago
2026-04-06
RELEVANCE
8/ 10
AUTHOR
SwordfishGreat4532