OPEN_SOURCE ↗
REDDIT · REDDIT// 21d agoBENCHMARK RESULT
Qwen3.5 35B hits iPhone, 5.6 tok/sec
Alexintosh says he ported a Metal inference engine to iOS and got Qwen3.5 35B running fully on-device in 4-bit at 5.6 tok/sec. The demo uses SSD streaming for MoE experts and hints at a bigger 379B run next.
// ANALYSIS
This is a strong proof-of-concept for phone-side MoE inference: the headline number is less important than the memory trick that makes it possible. If this holds up beyond a demo, local AI on mobile is moving from “tiny toy models” toward genuinely useful mid-sized models.
- –256-expert sparsity plus 4-bit quantization is exactly the kind of setup that can make a large MoE model mobile-feasible.
- –SSD-to-GPU streaming is the real engineering win here; it shifts the bottleneck from fit-to-RAM to bandwidth and scheduling.
- –5.6 tok/sec is not desktop-fast, but it is fast enough to feel interactive on a phone.
- –The 379B follow-up will be the more interesting stress test, because it will show whether the approach scales or just flatters the smaller model.
- –This is more infrastructure than model hype: the model is familiar, but the deployment path is the news.
// TAGS
qwen3.5llminferenceedge-aigpuopen-sourcebenchmark
DISCOVERED
21d ago
2026-03-22
PUBLISHED
21d ago
2026-03-22
RELEVANCE
8/ 10
AUTHOR
Alexintosh