OPEN_SOURCE ↗
HN · HACKER_NEWS// 19d agoINFRASTRUCTURE
ANEMLL demos 400B LLM on iPhone 17 Pro
ANEMLL showed a 400B-class sparse MoE model running on an iPhone 17 Pro at roughly 0.6 t/s. It is far too slow for casual chat, but it’s a striking proof-of-concept for Apple Neural Engine inference and aggressive local-model compression.
// ANALYSIS
This is more of a systems stunt than a product milestone, but the engineering is still real: ANEMLL is pushing on-device inference into territory that used to belong to desktop GPUs. The thread itself undercuts the headline a bit by noting the model is MoE and the team is rebuilding 4-bit weights, which makes the claim impressive without making it magical.
- –0.6 t/s is not usable for normal chat, so the demo proves feasibility, not comfort.
- –The replies point to a giant KV cache and CPU/GPU RAM sharing, so memory management is doing as much work as raw model size.
- –Calling it "400B" is only fair if you read that as total sparse parameters, not active parameters per token.
- –For ANEMLL, the bigger win is validating its ANE/CoreML pipeline for extreme edge cases, which could matter more than this one stunt.
// TAGS
anemllllminferenceedge-aiopen-source
DISCOVERED
19d ago
2026-03-23
PUBLISHED
19d ago
2026-03-23
RELEVANCE
8/ 10
AUTHOR
anemll