BACK_TO_FEEDAICRIER_2
Qwen3.5 35B hits iPhone, 5.6 tok/sec
OPEN_SOURCE ↗
REDDIT · REDDIT// 21d agoBENCHMARK RESULT

Qwen3.5 35B hits iPhone, 5.6 tok/sec

Alexintosh says he ported a Metal inference engine to iOS and got Qwen3.5 35B running fully on-device in 4-bit at 5.6 tok/sec. The demo uses SSD streaming for MoE experts and hints at a bigger 379B run next.

// ANALYSIS

This is a strong proof-of-concept for phone-side MoE inference: the headline number is less important than the memory trick that makes it possible. If this holds up beyond a demo, local AI on mobile is moving from “tiny toy models” toward genuinely useful mid-sized models.

  • 256-expert sparsity plus 4-bit quantization is exactly the kind of setup that can make a large MoE model mobile-feasible.
  • SSD-to-GPU streaming is the real engineering win here; it shifts the bottleneck from fit-to-RAM to bandwidth and scheduling.
  • 5.6 tok/sec is not desktop-fast, but it is fast enough to feel interactive on a phone.
  • The 379B follow-up will be the more interesting stress test, because it will show whether the approach scales or just flatters the smaller model.
  • This is more infrastructure than model hype: the model is familiar, but the deployment path is the news.
// TAGS
qwen3.5llminferenceedge-aigpuopen-sourcebenchmark

DISCOVERED

21d ago

2026-03-22

PUBLISHED

21d ago

2026-03-22

RELEVANCE

8/ 10

AUTHOR

Alexintosh