OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT
Qwen3.5-35B-A3B hits 60 tok/s on 4060 Ti
A LocalLLaMA user reports tuning llama.cpp to run Unsloth’s Qwen3.5-35B-A3B GGUF at 64k context on an RTX 4060 Ti 16GB, with real-world throughput around 40-60 tok/s. The post focuses on the config shape that made the difference: KV-unified routing, batch sizing, MoE CPU offload, and keeping VRAM pressure under control.
// ANALYSIS
This is a useful reality check for local LLM inference: the bottleneck is often runtime configuration, not just GPU size.
- –The win appears to come from the effective runtime shape, not a single magic flag, which is why the author emphasizes `n_parallel`, `kv_unified`, `n_ctx_seq`, `n_ctx_slot`, `n_batch`, and `n_ubatch`.
- –The reported numbers come from actual continuation and long-context runs, so they are more credible than a one-off prompt benchmark.
- –A 16GB consumer card handling a 35B-class MoE model at 64k context makes local deployment feel practical for serious dev workflows, not just hobby demos.
- –The post also points to a gap in the ecosystem: people need shared, hardware-specific config baselines instead of rediscovering the same tuning tricks.
// TAGS
qwen3.5-35b-a3bllama.cppllminferencegpuopen-weightsself-hostedbenchmark
DISCOVERED
4h ago
2026-04-16
PUBLISHED
1d ago
2026-04-15
RELEVANCE
8/ 10
AUTHOR
Nutty_Praline404