ik_llama.cpp tops 110 tok/s on RTX 4070 Super
A Reddit user reports a substantial local inference speedup on an RTX 4070 Super 12GB by switching from upstream llama.cpp to ik_llama.cpp for Qwen3.6-35B-A3B-IQ4_XS MTP workloads. Using the same benchmark script and broadly similar settings, they say throughput rose from about 89.8 tok/s in llama.cpp to 110.2 tok/s in ik_llama.cpp, and they shared the exact launch flags they used to fit the model into 12GB VRAM.
The interesting part here is not just the raw number, but that the win comes from runtime/launcher differences rather than a different model. For people squeezing 30B+ MoE models onto consumer GPUs, that is the kind of optimization that actually changes what is usable.
- –Strong practical result: same class of quant, same hardware, roughly 22% more throughput.
- –The post is more of a reproducible tuning note than a model breakthrough, which makes it useful for local LLM operators.
- –The benchmark also shows lower acceptance rate in ik_llama.cpp, so the speedup is not a free lunch and may reflect a different draft/accept tradeoff.
- –Hardware setup matters a lot here: secondary GPU usage plus iGPU display output to maximize available VRAM is part of the recipe.
- –Best fit for readers running high-context local models on 12GB cards, especially if they are already using MTP.
DISCOVERED
1h ago
2026-05-21
PUBLISHED
2h ago
2026-05-21
RELEVANCE
AUTHOR
janvitos