Qwen3.6 35B hits 564/41 on 5070 Ti
This Reddit post shows Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive running in GGUF Q4_K_M on an RTX 5070 Ti with a 262K context setup. The poster reports 564/41 token speed and shares the llama.cpp flags needed to keep the model usable with 16GB VRAM plus heavy RAM spillover.
The setup leans on llama.cpp flags such as n-cpu-moe, cache-type-k q4_0, and cache-type-v q4_0 to make a 35B MoE model fit. The memory profile is the constraint: 10.8/16GB VRAM plus shared RAM and normal RAM pressure, so this is viable only on a relatively loaded but carefully managed workstation. A 262K context window is impressive, but it makes performance claims highly configuration-dependent rather than broadly transferable. The TurboQuants miss is a useful warning sign that local LLM tuning still has rough edges even when the base model runs well.
DISCOVERED
2h ago
2026-05-11
PUBLISHED
3h ago
2026-05-11
RELEVANCE
AUTHOR
KptEmreU