BACK_TO_FEEDAICRIER_2
Qwen3.6, Gemma 4 run fast on 12GB
OPEN_SOURCE ↗
REDDIT · REDDIT// 6h agoBENCHMARK RESULT

Qwen3.6, Gemma 4 run fast on 12GB

A 4070 Super 12GB user reports surprisingly strong local throughput on Qwen3.6 and Gemma 4 quants using Unsloth GGUFs and heavily tuned llama.cpp configs. The setup shows that 12GB VRAM can still handle serious coding-focused models if you lean on quantization, offload, and aggressive batching.

// ANALYSIS

This is less a model shootout than a reminder that local inference is now mostly a systems-tuning problem. The raw numbers are impressive, but they also show how much performance depends on cache settings, MoE offload, speculative decoding, and whether you can spare system RAM.

  • Qwen3.6-35B-A3B is the standout here, with the post claiming about 40 t/s generation and 2100 pps prompt throughput on a 12GB card.
  • Gemma 4 26B at Q8_0 is close behind and looks like the best balance of speed and size in this setup, while the 31B IQ3 quant falls off sharply.
  • The config details matter as much as the models: `n-cpu-moe`, `flash-attn`, `reasoning`, `preserve_thinking`, and `ngram-mod` speculative decoding are doing real work.
  • This is a useful signal for local AI coding agents in VS Code, but it is still a single-user benchmark, not a general quality verdict on the models themselves.
// TAGS
qwen3.6-35b-a3bgemma-4llama-cppunslothllmai-codinggpubenchmark

DISCOVERED

6h ago

2026-04-30

PUBLISHED

6h ago

2026-04-30

RELEVANCE

8/ 10

AUTHOR

mr_Owner