YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

ik_llama.cpp tops 110 tok/s on RTX 4070 Super

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

ik_llama.cpp tops 110 tok/s on RTX 4070 Super
OPEN LINK ↗
// 1h agoTUTORIAL

ik_llama.cpp tops 110 tok/s on RTX 4070 Super

A Reddit user reports a substantial local inference speedup on an RTX 4070 Super 12GB by switching from upstream llama.cpp to ik_llama.cpp for Qwen3.6-35B-A3B-IQ4_XS MTP workloads. Using the same benchmark script and broadly similar settings, they say throughput rose from about 89.8 tok/s in llama.cpp to 110.2 tok/s in ik_llama.cpp, and they shared the exact launch flags they used to fit the model into 12GB VRAM.

// ANALYSIS

The interesting part here is not just the raw number, but that the win comes from runtime/launcher differences rather than a different model. For people squeezing 30B+ MoE models onto consumer GPUs, that is the kind of optimization that actually changes what is usable.

  • Strong practical result: same class of quant, same hardware, roughly 22% more throughput.
  • The post is more of a reproducible tuning note than a model breakthrough, which makes it useful for local LLM operators.
  • The benchmark also shows lower acceptance rate in ik_llama.cpp, so the speedup is not a free lunch and may reflect a different draft/accept tradeoff.
  • Hardware setup matters a lot here: secondary GPU usage plus iGPU display output to maximize available VRAM is part of the recipe.
  • Best fit for readers running high-context local models on 12GB cards, especially if they are already using MTP.
// TAGS
ik-llama-cppllama-cppqwen3-6mtplocal-firstquantizationrtx-4070-supervrambenchmark

DISCOVERED

1h ago

2026-05-21

PUBLISHED

2h ago

2026-05-21

RELEVANCE

9/ 10

AUTHOR

janvitos