YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Qwen3.5 35B hits iPhone, 5.6 tok/sec

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Qwen3.5 35B hits iPhone, 5.6 tok/sec
OPEN LINK ↗
// 67d agoBENCHMARK RESULT

Qwen3.5 35B hits iPhone, 5.6 tok/sec

Alexintosh says he ported a Metal inference engine to iOS and got Qwen3.5 35B running fully on-device in 4-bit at 5.6 tok/sec. The demo uses SSD streaming for MoE experts and hints at a bigger 379B run next.

// ANALYSIS

This is a strong proof-of-concept for phone-side MoE inference: the headline number is less important than the memory trick that makes it possible. If this holds up beyond a demo, local AI on mobile is moving from “tiny toy models” toward genuinely useful mid-sized models.

  • 256-expert sparsity plus 4-bit quantization is exactly the kind of setup that can make a large MoE model mobile-feasible.
  • SSD-to-GPU streaming is the real engineering win here; it shifts the bottleneck from fit-to-RAM to bandwidth and scheduling.
  • 5.6 tok/sec is not desktop-fast, but it is fast enough to feel interactive on a phone.
  • The 379B follow-up will be the more interesting stress test, because it will show whether the approach scales or just flatters the smaller model.
  • This is more infrastructure than model hype: the model is familiar, but the deployment path is the news.
// TAGS
qwen3.5llminferenceedge-aigpuopen-sourcebenchmark

DISCOVERED

67d ago

2026-03-22

PUBLISHED

67d ago

2026-03-22

RELEVANCE

8/ 10

AUTHOR

Alexintosh