Qwen3.6-35B-A3B hits 79 t/s, 128K ctx

// 90d agoBENCHMARK RESULT

Qwen3.6-35B-A3B hits 79 t/s, 128K ctx

A Reddit benchmark shows Qwen3.6-35B-A3B running at 78.7-79.3 tokens/sec on an RTX 5070 Ti when llama.cpp uses `--n-cpu-moe` instead of the default `--cpu-moe`. The setup also claims 128K context is practical with `-np 1` and q8 KV cache settings.

// ANALYSIS

This reads less like a model breakthrough and more like a reminder that MoE placement policy can make or break local inference. On 16GB GPUs, the difference between keeping experts pinned to CPU versus splitting them intelligently is the difference between wasting VRAM and getting a genuinely usable speedup.

–`--cpu-moe` leaves most of the GPU underutilized; `--n-cpu-moe 20` pushes the later experts onto VRAM and roughly jumps generation from 51.2 t/s to 78.7 t/s.
–`-np 1` matters for single-user serving because it avoids wasting memory on extra recurrent-state slots, which helps make the 128K context claim believable in a constrained setup.
–The result is hardware-specific, but the pattern is useful: for sparse MoE models, inference tuning can matter as much as quantization choice.
–The 9800X3D’s large L3 cache probably helps the CPU side, so this is not a universal 16GB-GPU recipe, just a strong data point for consumer desktop rigs.

// TAGS

qwen3-6-35b-a3bllmgpubenchmarkinferenceopen-source

DISCOVERED

90d ago

2026-04-18

PUBLISHED

90d ago

2026-04-18

RELEVANCE

10/ 10

AUTHOR

marlang

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE28m ago

CodexBar ships AI-built layout editor

In response to persistent user feedback, CodexBar creator Peter Steinberger used an AI developer agent to build and merge an interactive drag-and-drop token layout editor (PR #2275). The new feature replaces text-style controls with modular tokens, offering presets, live previews, robust backwards-compatible migration, and multi-lingual support.

UPDATE50m ago

Orca Mobile launches agent chat UI

Orca has released its Chat UI for Orca Mobile in beta for iOS and Android, allowing developers to monitor and control desktop AI coding agents remotely. Developed with RunFusion, the update introduces a free mobile relay service that eliminates the need for a Tailscale setup.

FUNDING56m ago

After Labs emerges from stealth with funding

After Labs is a newly unveiled AI research lab focusing on the development of efficient fluid intelligence. The startup has officially come out of stealth after securing funding, drawing congratulations from prominent AI figures including François Chollet for founders Clem and Matt.