BACK_TO_FEEDAICRIER_2
ANEMLL demos 400B LLM on iPhone 17 Pro
OPEN_SOURCE ↗
HN · HACKER_NEWS// 19d agoINFRASTRUCTURE

ANEMLL demos 400B LLM on iPhone 17 Pro

ANEMLL showed a 400B-class sparse MoE model running on an iPhone 17 Pro at roughly 0.6 t/s. It is far too slow for casual chat, but it’s a striking proof-of-concept for Apple Neural Engine inference and aggressive local-model compression.

// ANALYSIS

This is more of a systems stunt than a product milestone, but the engineering is still real: ANEMLL is pushing on-device inference into territory that used to belong to desktop GPUs. The thread itself undercuts the headline a bit by noting the model is MoE and the team is rebuilding 4-bit weights, which makes the claim impressive without making it magical.

  • 0.6 t/s is not usable for normal chat, so the demo proves feasibility, not comfort.
  • The replies point to a giant KV cache and CPU/GPU RAM sharing, so memory management is doing as much work as raw model size.
  • Calling it "400B" is only fair if you read that as total sparse parameters, not active parameters per token.
  • For ANEMLL, the bigger win is validating its ANE/CoreML pipeline for extreme edge cases, which could matter more than this one stunt.
// TAGS
anemllllminferenceedge-aiopen-source

DISCOVERED

19d ago

2026-03-23

PUBLISHED

19d ago

2026-03-23

RELEVANCE

8/ 10

AUTHOR

anemll