YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Apple's LLM in Flash stress-tests Qwen3.5-397B locally

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Apple's LLM in Flash stress-tests Qwen3.5-397B locally
OPEN LINK ↗
// 69d agoNEWS

Apple's LLM in Flash stress-tests Qwen3.5-397B locally

A Reddit discussion spotlights Dan Woods’ experiment combining Karpathy’s autoresearch workflow with Apple’s LLM in a Flash paper to run Qwen3.5-397B on an M3 MacBook Pro with 48GB of RAM at about 5.7 tokens per second. The result is less about making a 397B model “small” and more about showing that flash-aware loading, sparse activation, and iterative harness tuning can make very large models surprisingly usable on consumer hardware.

// ANALYSIS

Hot take: this feels like a meaningful infrastructure signal, not just a flashy benchmark stunt.

  • The interesting part is the method stack: autonomous experiment loops plus memory-aware inference ideas turned into a practical local-run harness.
  • The reported speed is impressive for a model this large, especially on 48GB unified memory, even if the MoE/sparse setup softens the headline a bit.
  • The poster’s claim that the same hardware might reach roughly 18 tokens/sec suggests there is still a lot of headroom in the loading and access pattern.
  • If this approach generalizes, SSD bandwidth and memory access strategy become first-class deployment constraints for local LLMs.
// TAGS
llm-in-a-flashqwen3.5autoresearchlocal-llmmacbook-proapple-siliconmixture-of-expertsinference

DISCOVERED

69d ago

2026-03-19

PUBLISHED

69d ago

2026-03-19

RELEVANCE

8/ 10

AUTHOR

pscoutou