Qwen 3.6 reaches 37 t/s on 3060
An optimized stack using spiritbuun’s llama-cpp fork and mudler’s APEX quantization enables Qwen 3.6 35B to generate at 37 tokens/sec on a single 12GB RTX 3060. The setup pushes consumer hardware limits with 128K context support and perfect needle-in-a-haystack retrieval.
VRAM capacity is no longer a hard ceiling for large model inference on consumer hardware when paired with optimized compute kernels.
- –Spiritbuun's CUDA enhancements, including fused MMA and TurboQuant, allow efficient offloading of a 17.3GB model onto a 12GB card with minimal penalty.
- –Mudler’s APEX I-Compact quantization delivers a decisive performance gap over standard variants like Unsloth or Bartowsky.
- –The -fitt 1500 flag provides a critical workaround for mmproj memory management, preventing OOMs during multimodal offloading.
- –Multi-Token Prediction (MTP) is shown to be detrimental in memory-constrained offloading scenarios, emphasizing the need for raw compute optimization.
DISCOVERED
1h ago
2026-05-28
PUBLISHED
2h ago
2026-05-28
RELEVANCE
AUTHOR
old-mike