OPEN_SOURCE ↗
REDDIT · REDDIT// 6h agoBENCHMARK RESULT
Qwen3.6, Gemma 4 run fast on 12GB
A 4070 Super 12GB user reports surprisingly strong local throughput on Qwen3.6 and Gemma 4 quants using Unsloth GGUFs and heavily tuned llama.cpp configs. The setup shows that 12GB VRAM can still handle serious coding-focused models if you lean on quantization, offload, and aggressive batching.
// ANALYSIS
This is less a model shootout than a reminder that local inference is now mostly a systems-tuning problem. The raw numbers are impressive, but they also show how much performance depends on cache settings, MoE offload, speculative decoding, and whether you can spare system RAM.
- –Qwen3.6-35B-A3B is the standout here, with the post claiming about 40 t/s generation and 2100 pps prompt throughput on a 12GB card.
- –Gemma 4 26B at Q8_0 is close behind and looks like the best balance of speed and size in this setup, while the 31B IQ3 quant falls off sharply.
- –The config details matter as much as the models: `n-cpu-moe`, `flash-attn`, `reasoning`, `preserve_thinking`, and `ngram-mod` speculative decoding are doing real work.
- –This is a useful signal for local AI coding agents in VS Code, but it is still a single-user benchmark, not a general quality verdict on the models themselves.
// TAGS
qwen3.6-35b-a3bgemma-4llama-cppunslothllmai-codinggpubenchmark
DISCOVERED
6h ago
2026-04-30
PUBLISHED
6h ago
2026-04-30
RELEVANCE
8/ 10
AUTHOR
mr_Owner