Qwen3.6-35B-A3B tops 80 tok/sec in llama.cpp

// 2h agoBENCHMARK RESULT

Qwen3.6-35B-A3B tops 80 tok/sec in llama.cpp

A Reddit guide shows Qwen3.6-35B-A3B running through llama.cpp MTP on a 12GB RTX 4070 Super and clearing 80 tok/sec in the author’s benchmark. The trick is careful CPU/GPU balancing plus KV-cache quantization, which keeps both throughput and 128K context within reach.

// ANALYSIS

This is a strong local-inference result, but the real story is memory choreography rather than magic hardware. It shows how far sparse MoE plus speculative decoding can stretch a “too big for 12GB” model when the runtime is tuned hard.

–`-fitt 1664` is doing the heavy lifting by reserving enough VRAM for the draft model and KV cache while letting llama.cpp spill the rest intelligently.
–The posted 70-82 tok/s range is respectable, but the acceptance rates matter just as much as raw draft-token generation.
–128K context on 12GB is the more meaningful achievement here; many local setups can go fast only when the prompt stays short.
–This is not a universal 12GB recipe, especially if the GPU is also driving a display, so real-world headroom will vary.

// TAGS

llmlong-contextmoequantizationinferencegpubenchmarkqwen3.6-35b-a3b

DISCOVERED

2h ago

2026-05-09

PUBLISHED

3h ago

2026-05-09

RELEVANCE

8/ 10

AUTHOR

janvitos

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

TUTORIAL1h ago

OpenAI teaches practical Codex workflows in free webinar

OpenAI Academy published a beginner-friendly session on using Codex for everyday work, framed around moving beyond simple prompts and into reviewable outputs like reports, workflows, dashboards, and daily briefs. The session includes guidance on what to hand off to Codex, how to review its work, and a live demo focused on building a Daily Work Brief workflow, which makes this feel more like hands-on training than a product announcement.

UPDATE2h ago

OpenCode Bridge Enables Clipboard Images Over SSH

OpenCode now has a practical workaround for getting clipboard images into remote sessions over SSH and tmux. It is aimed at terminal-first developers who work on remote machines.

UPDATE3h ago

Mercury Agent Teases Copilot, Codex Support

Mercury Agent is teasing v1.1.7 with provider support for GitHub Copilot and OpenAI Codex, letting users route work through either ecosystem inside one agent workflow. It builds on Mercury’s existing CLI, daemon, and memory-first setup rather than adding yet another standalone editor.