Qwen3.6-35B-A3B hits ~190K context on 8GB VRAM

// 3h agoBENCHMARK RESULT

Qwen3.6-35B-A3B hits ~190K context on 8GB VRAM

This post shares a practical local-inference setup for running Qwen3.6-35B-A3B with roughly 190K context on an RTX 4060 8GB laptop paired with 32GB DDR5 RAM over a Tailscale-accessed Linux server. The author reports strong throughput on Q5 GGUF builds, with performance improving further after tuning `ctx-size`, `n-gpu-layers`, `n-cpu-moe`, and TurboQuant KV cache settings in a custom llama.cpp fork.

// ANALYSIS

Hot take: this is less a model announcement than a useful real-world stress test showing how far sparse MoE inference can be pushed on consumer hardware when the memory layout is tuned carefully.

–The main value is the configuration recipe, not just the raw benchmark numbers, because it shows what actually moved throughput at very large context.
–The post suggests Q5 materially outperforms Q4 for long-context reasoning on this model family, which is a useful signal for anyone optimizing quality vs speed.
–TurboQuant KV cache appears to be the key enabler at ~190K context, making the setup much more viable than standard cache behavior.
–The Linux + DDR5 emphasis is believable and practical: bandwidth, paging behavior, and mmap/mlock choices likely matter more than people expect.
–`n-cpu-moe` tuning is the most interesting knob here, but the author is still in the exploratory phase rather than presenting a universally optimal value.

// TAGS

qwenqwen3.6llama.cppturboquantquantizationkv-cachelong-contextlocal-firstmoeinference

DISCOVERED

3h ago

2026-05-10

PUBLISHED

5h ago

2026-05-10

RELEVANCE

9/ 10

AUTHOR

Atul_Kumar_97

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE42m ago

Oh My OpenAgent Adds Team Mode

Oh My OpenAgent v4.0.0 adds Team Mode, turning the open-source harness into a multi-agent system with coordinated specialists, parallel execution, and tmux-based visibility. The release also tightens model-specific prompts and reliability fixes across the agent runtime.

VIDEO3h ago

Bun Rust Rewrite Branch Sparks Debate

A Better Stack video spotlights Bun’s experimental Rust port branch and the porting guide mapping Bun’s Zig patterns to Rust equivalents. The branch is drawing attention because it looks surprisingly far along for a proof-of-concept, but it is still explicitly experimental.

NEWS4h ago

Nous Research hosts Hermes Agent Jam

Nous Research is inviting the Hermes Agent community into an interactive Discord jam to trade ideas and show off projects. It looks like a builder-focused session aimed at surfacing real-world workflows, demos, and feedback around the agent.