ik_llama.cpp tops 110 tok/s on RTX 4070 Super

// 45d agoTUTORIAL

ik_llama.cpp tops 110 tok/s on RTX 4070 Super

A Reddit user reports a substantial local inference speedup on an RTX 4070 Super 12GB by switching from upstream llama.cpp to ik_llama.cpp for Qwen3.6-35B-A3B-IQ4_XS MTP workloads. Using the same benchmark script and broadly similar settings, they say throughput rose from about 89.8 tok/s in llama.cpp to 110.2 tok/s in ik_llama.cpp, and they shared the exact launch flags they used to fit the model into 12GB VRAM.

// ANALYSIS

The interesting part here is not just the raw number, but that the win comes from runtime/launcher differences rather than a different model. For people squeezing 30B+ MoE models onto consumer GPUs, that is the kind of optimization that actually changes what is usable.

–Strong practical result: same class of quant, same hardware, roughly 22% more throughput.
–The post is more of a reproducible tuning note than a model breakthrough, which makes it useful for local LLM operators.
–The benchmark also shows lower acceptance rate in ik_llama.cpp, so the speedup is not a free lunch and may reflect a different draft/accept tradeoff.
–Hardware setup matters a lot here: secondary GPU usage plus iGPU display output to maximize available VRAM is part of the recipe.
–Best fit for readers running high-context local models on 12GB cards, especially if they are already using MTP.

// TAGS

ik-llama-cppllama-cppqwen3-6mtplocal-firstquantizationrtx-4070-supervrambenchmark

DISCOVERED

45d ago

2026-05-21

PUBLISHED

45d ago

2026-05-21

RELEVANCE

9/ 10

AUTHOR

janvitos

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE33m ago

xAI releases Grok Build 0.2.87

Grok Build 0.2.87 is a quality-of-life release for xAI's command-line interface coding agent. The update introduces automatic detection of subscription upgrades to eliminate CLI restarts and adds a persistent "Never allow" option to Bash permission prompts.

NEWS1h ago

Developer Pairs Codex and Cursor for AI Coding

The post highlights a developer's workflow combining OpenAI's Codex model with the Cursor IDE. The developer notes that an IDE is essential for reviewing Codex's outputs and maintaining a project overview, and praises Cursor's built-in Composer 2.5 model as a highly effective tool for many development tasks.

MODEL2h ago

Grok 4.5 enters private beta

Grok 4.5, xAI's next-generation large language model, is reportedly in private beta testing at Tesla and SpaceX. Powered by a massive 1.5 trillion-parameter V9 model, its early performance is described by Elon Musk as close to, or perhaps exceeding, Anthropic's Claude 3 Opus, signaling a significant capability upgrade for xAI's suite of products.