BACK_TO_FEEDAICRIER_2
Qwen3.6-35B-A3B gets long-context tuning tips
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoTUTORIAL

Qwen3.6-35B-A3B gets long-context tuning tips

Reddit users are benchmarking Qwen3.6-35B-A3B locally with llama.cpp, including vision support, 90K context, and aggressive GPU offload on an 8GB VRAM card plus 24GB RAM. The discussion centers on whether the slowdown comes from the model size, the long context window, or suboptimal inference flags.

// ANALYSIS

Qwen3.6-35B-A3B is showing the usual MoE promise and long-context pain at the same time: it is small in active compute, but the memory and attention costs still bite hard once you push 90K tokens on consumer hardware.

  • The model’s appeal is clear: 35B total parameters with only 3B active makes it attractive for local multimodal use.
  • The observed throughput drop over time points to KV-cache pressure and context growth, not just raw parameter count.
  • Vision support via `mmproj-F16` makes this a practical local multimodal stack, but that also increases memory pressure on a tight 8GB GPU budget.
  • The post is really about inference discipline: too many flags can hide the real bottleneck and make tuning harder than the model itself.
// TAGS
qwen3-6-35b-a3bllminferencegpumultimodalllama.cpp

DISCOVERED

4h ago

2026-04-19

PUBLISHED

7h ago

2026-04-19

RELEVANCE

8/ 10

AUTHOR

FUS3N