OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoTUTORIAL
Qwen3.6-35B-A3B gets long-context tuning tips
Reddit users are benchmarking Qwen3.6-35B-A3B locally with llama.cpp, including vision support, 90K context, and aggressive GPU offload on an 8GB VRAM card plus 24GB RAM. The discussion centers on whether the slowdown comes from the model size, the long context window, or suboptimal inference flags.
// ANALYSIS
Qwen3.6-35B-A3B is showing the usual MoE promise and long-context pain at the same time: it is small in active compute, but the memory and attention costs still bite hard once you push 90K tokens on consumer hardware.
- –The model’s appeal is clear: 35B total parameters with only 3B active makes it attractive for local multimodal use.
- –The observed throughput drop over time points to KV-cache pressure and context growth, not just raw parameter count.
- –Vision support via `mmproj-F16` makes this a practical local multimodal stack, but that also increases memory pressure on a tight 8GB GPU budget.
- –The post is really about inference discipline: too many flags can hide the real bottleneck and make tuning harder than the model itself.
// TAGS
qwen3-6-35b-a3bllminferencegpumultimodalllama.cpp
DISCOVERED
4h ago
2026-04-19
PUBLISHED
7h ago
2026-04-19
RELEVANCE
8/ 10
AUTHOR
FUS3N