Qwen3.5 35B hits iPhone, 5.6 tok/sec

// 112d agoBENCHMARK RESULT

Qwen3.5 35B hits iPhone, 5.6 tok/sec

Alexintosh says he ported a Metal inference engine to iOS and got Qwen3.5 35B running fully on-device in 4-bit at 5.6 tok/sec. The demo uses SSD streaming for MoE experts and hints at a bigger 379B run next.

// ANALYSIS

This is a strong proof-of-concept for phone-side MoE inference: the headline number is less important than the memory trick that makes it possible. If this holds up beyond a demo, local AI on mobile is moving from “tiny toy models” toward genuinely useful mid-sized models.

–256-expert sparsity plus 4-bit quantization is exactly the kind of setup that can make a large MoE model mobile-feasible.
–SSD-to-GPU streaming is the real engineering win here; it shifts the bottleneck from fit-to-RAM to bandwidth and scheduling.
–5.6 tok/sec is not desktop-fast, but it is fast enough to feel interactive on a phone.
–The 379B follow-up will be the more interesting stress test, because it will show whether the approach scales or just flatters the smaller model.
–This is more infrastructure than model hype: the model is familiar, but the deployment path is the news.

// TAGS

qwen3.5llminferenceedge-aigpuopen-sourcebenchmark

DISCOVERED

112d ago

2026-03-22

PUBLISHED

112d ago

2026-03-22

RELEVANCE

8/ 10

AUTHOR

Alexintosh

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

VIDEO32m ago

Higgsfield drops developer CLI and MCP server

Higgsfield has launched a developer CLI and MCP server, allowing programmers and autonomous agents to programmatically trigger, customize, and edit marketing ads and cinematic videos directly through terminal commands. Demonstrated by developer Cole Medin using Anthropic's Claude Code and the Archon workflow engine, the toolkit enables fully automated video production pipelines.

OPEN SOURCE32m ago

AI Content Factory automates video ads

AI Content Factory is an open-source workflow that automates bulk marketing video generation from a product catalog. Built on the Archon agentic engine and Higgsfield CLI, it reduces costs by gating expensive video rendering behind cheap image exploration and human approval.

NEWS2h ago

George Hotz shares his enthusiasm for LLMs and open-source coding agents while criticizing doom-mongering and the overinflated valuations of frontier AI labs.

George Hotz (geohot) details his excitement for the practical applications of AI—such as LLMs, self-driving cars, video generation models, and AI coding agents—highlighting his successful setup of the open-source agent OpenCode on a local GLM-5.2 model. However, he strongly criticizes the prevailing industry hype, safety-related doom-mongering, and the multibillion-dollar valuations of frontier AI labs. Hotz argues that frontier labs will fail to capture most of the AI value because AI is a commodity driven by Moore's law and general computing progress. He also frames coding models not as autonomous creators, but as valuable productivity tools analogous to compilers, find-and-replace, or Stack Overflow that are changing the nature of programming.