REDDIT · REDDIT// 6h agoBENCHMARK RESULT

Qwen3.6, Gemma 4 run fast on 12GB

A 4070 Super 12GB user reports surprisingly strong local throughput on Qwen3.6 and Gemma 4 quants using Unsloth GGUFs and heavily tuned llama.cpp configs. The setup shows that 12GB VRAM can still handle serious coding-focused models if you lean on quantization, offload, and aggressive batching.

// ANALYSIS

This is less a model shootout than a reminder that local inference is now mostly a systems-tuning problem. The raw numbers are impressive, but they also show how much performance depends on cache settings, MoE offload, speculative decoding, and whether you can spare system RAM.

–Qwen3.6-35B-A3B is the standout here, with the post claiming about 40 t/s generation and 2100 pps prompt throughput on a 12GB card.
–Gemma 4 26B at Q8_0 is close behind and looks like the best balance of speed and size in this setup, while the 31B IQ3 quant falls off sharply.
–The config details matter as much as the models: `n-cpu-moe`, `flash-attn`, `reasoning`, `preserve_thinking`, and `ngram-mod` speculative decoding are doing real work.
–This is a useful signal for local AI coding agents in VS Code, but it is still a single-user benchmark, not a general quality verdict on the models themselves.

// TAGS

qwen3.6-35b-a3bgemma-4llama-cppunslothllmai-codinggpubenchmark

DISCOVERED

6h ago

2026-04-30

PUBLISHED

6h ago

2026-04-30

RELEVANCE

8/ 10

AUTHOR

mr_Owner