BACK_TO_FEEDAICRIER_2
Qwen3.6-35B-A3B runs on 16GB Mac mini
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT

Qwen3.6-35B-A3B runs on 16GB Mac mini

A Reddit user reports loading and chatting with Qwen3.6-35B-A3B on a Mac mini M4 with 16GB RAM using an Unsloth GGUF quant and llama-server, claiming a bit over 6 tokens per second. It’s a useful proof point for how far sparse MoE models can be pushed on consumer hardware.

// ANALYSIS

The big story here is not raw speed, it’s feasibility: a 35B-parameter open model with only ~3B active per token can be made usable on a tiny Mac, which keeps local AI from feeling gated by workstation-class hardware.

  • The posted setup uses an aggressive 4-bit quant, which is what makes the memory footprint plausible on 16GB systems
  • The shared command appears to disable GPU offload, so the result is more of a CPU-bound local inference test than a best-case Apple Silicon benchmark
  • Even at just over 6 tok/sec, this is enough for interactive use cases like private chat, agent loops, and lightweight coding help
  • For developers, the takeaway is that “open-weight frontier-ish” models are increasingly a packaging and quantization problem, not just a server-room problem
  • The result should be read as a practical anecdote, not a universal performance claim, because context size, cache settings, and backend choices will move the number a lot
// TAGS
qwen3.6-35b-a3bllminferenceself-hostedclibenchmark

DISCOVERED

4h ago

2026-04-19

PUBLISHED

8h ago

2026-04-19

RELEVANCE

8/ 10

AUTHOR

DKO75