Qwen3.5-35B-A3B hits 60 tok/s on 4060 Ti
A LocalLLaMA user reports tuning llama.cpp to run Unsloth’s Qwen3.5-35B-A3B GGUF at 64k context on an RTX 4060 Ti 16GB, with real-world throughput around 40-60 tok/s. The post focuses on the config shape that made the difference: KV-unified routing, batch sizing, MoE CPU offload, and keeping VRAM pressure under control.
This is a useful reality check for local LLM inference: the bottleneck is often runtime configuration, not just GPU size.
- –The win appears to come from the effective runtime shape, not a single magic flag, which is why the author emphasizes `n_parallel`, `kv_unified`, `n_ctx_seq`, `n_ctx_slot`, `n_batch`, and `n_ubatch`.
- –The reported numbers come from actual continuation and long-context runs, so they are more credible than a one-off prompt benchmark.
- –A 16GB consumer card handling a 35B-class MoE model at 64k context makes local deployment feel practical for serious dev workflows, not just hobby demos.
- –The post also points to a gap in the ecosystem: people need shared, hardware-specific config baselines instead of rediscovering the same tuning tricks.
DISCOVERED
45d ago
2026-04-16
PUBLISHED
46d ago
2026-04-15
RELEVANCE
AUTHOR
Nutty_Praline404