Qwen3.6-35B-A3B runs on 16GB Mac mini

// 90d agoBENCHMARK RESULT

Qwen3.6-35B-A3B runs on 16GB Mac mini

A Reddit user reports loading and chatting with Qwen3.6-35B-A3B on a Mac mini M4 with 16GB RAM using an Unsloth GGUF quant and llama-server, claiming a bit over 6 tokens per second. It’s a useful proof point for how far sparse MoE models can be pushed on consumer hardware.

// ANALYSIS

The big story here is not raw speed, it’s feasibility: a 35B-parameter open model with only ~3B active per token can be made usable on a tiny Mac, which keeps local AI from feeling gated by workstation-class hardware.

–The posted setup uses an aggressive 4-bit quant, which is what makes the memory footprint plausible on 16GB systems
–The shared command appears to disable GPU offload, so the result is more of a CPU-bound local inference test than a best-case Apple Silicon benchmark
–Even at just over 6 tok/sec, this is enough for interactive use cases like private chat, agent loops, and lightweight coding help
–For developers, the takeaway is that “open-weight frontier-ish” models are increasingly a packaging and quantization problem, not just a server-room problem
–The result should be read as a practical anecdote, not a universal performance claim, because context size, cache settings, and backend choices will move the number a lot

// TAGS

qwen3.6-35b-a3bllminferenceself-hostedclibenchmark

DISCOVERED

90d ago

2026-04-19

PUBLISHED

90d ago

2026-04-19

RELEVANCE

8/ 10

AUTHOR

DKO75

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL40m ago

Kimi K3 launch strengthens open-source case

The release of Moonshot AI's Kimi K3, an open-weights model with 2.8 trillion parameters, a 1-million-token context window, and native visual processing, has sparked discussion about the viability of proprietary frontier LLM training. As open-weights models achieve performance parity with proprietary systems on key coding and agentic benchmarks, developers and investors are increasingly questioning the massive capital requirements of closed-source frontier projects in favor of more cost-effective open alternatives.

MODEL1h ago

Moonshot AI launches Kimi K3

Moonshot AI has launched Kimi K3, a natively multimodal 2.8-trillion-parameter model with a 1-million-token context window. Built on a novel attention architecture, the model is optimized for long-horizon coding and multi-step reasoning tasks.

MODEL3h ago

NVIDIA launches Ardy real-time motion model

NVIDIA's Spatial Intelligence Lab has developed Ardy, an autoregressive diffusion model for real-time, interactive 3D human motion generation. The model supports online text prompting and flexible kinematic constraints at inference time without requiring retraining, making it suitable for animation, gaming, and robotics.