OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT
Qwen3.6-35B-A3B runs on 16GB Mac mini
A Reddit user reports loading and chatting with Qwen3.6-35B-A3B on a Mac mini M4 with 16GB RAM using an Unsloth GGUF quant and llama-server, claiming a bit over 6 tokens per second. It’s a useful proof point for how far sparse MoE models can be pushed on consumer hardware.
// ANALYSIS
The big story here is not raw speed, it’s feasibility: a 35B-parameter open model with only ~3B active per token can be made usable on a tiny Mac, which keeps local AI from feeling gated by workstation-class hardware.
- –The posted setup uses an aggressive 4-bit quant, which is what makes the memory footprint plausible on 16GB systems
- –The shared command appears to disable GPU offload, so the result is more of a CPU-bound local inference test than a best-case Apple Silicon benchmark
- –Even at just over 6 tok/sec, this is enough for interactive use cases like private chat, agent loops, and lightweight coding help
- –For developers, the takeaway is that “open-weight frontier-ish” models are increasingly a packaging and quantization problem, not just a server-room problem
- –The result should be read as a practical anecdote, not a universal performance claim, because context size, cache settings, and backend choices will move the number a lot
// TAGS
qwen3.6-35b-a3bllminferenceself-hostedclibenchmark
DISCOVERED
4h ago
2026-04-19
PUBLISHED
8h ago
2026-04-19
RELEVANCE
8/ 10
AUTHOR
DKO75