YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Qwen3.6-35B-A3B tops 80 tok/sec in llama.cpp

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Qwen3.6-35B-A3B tops 80 tok/sec in llama.cpp
OPEN LINK ↗
// 2h agoBENCHMARK RESULT

Qwen3.6-35B-A3B tops 80 tok/sec in llama.cpp

A Reddit guide shows Qwen3.6-35B-A3B running through llama.cpp MTP on a 12GB RTX 4070 Super and clearing 80 tok/sec in the author’s benchmark. The trick is careful CPU/GPU balancing plus KV-cache quantization, which keeps both throughput and 128K context within reach.

// ANALYSIS

This is a strong local-inference result, but the real story is memory choreography rather than magic hardware. It shows how far sparse MoE plus speculative decoding can stretch a “too big for 12GB” model when the runtime is tuned hard.

  • `-fitt 1664` is doing the heavy lifting by reserving enough VRAM for the draft model and KV cache while letting llama.cpp spill the rest intelligently.
  • The posted 70-82 tok/s range is respectable, but the acceptance rates matter just as much as raw draft-token generation.
  • 128K context on 12GB is the more meaningful achievement here; many local setups can go fast only when the prompt stays short.
  • This is not a universal 12GB recipe, especially if the GPU is also driving a display, so real-world headroom will vary.
// TAGS
llmlong-contextmoequantizationinferencegpubenchmarkqwen3.6-35b-a3b

DISCOVERED

2h ago

2026-05-09

PUBLISHED

3h ago

2026-05-09

RELEVANCE

8/ 10

AUTHOR

janvitos