BACK_TO_FEEDAICRIER_2
Apple's LLM in Flash stress-tests Qwen3.5-397B locally
OPEN_SOURCE ↗
REDDIT · REDDIT// 23d agoNEWS

Apple's LLM in Flash stress-tests Qwen3.5-397B locally

A Reddit discussion spotlights Dan Woods’ experiment combining Karpathy’s autoresearch workflow with Apple’s LLM in a Flash paper to run Qwen3.5-397B on an M3 MacBook Pro with 48GB of RAM at about 5.7 tokens per second. The result is less about making a 397B model “small” and more about showing that flash-aware loading, sparse activation, and iterative harness tuning can make very large models surprisingly usable on consumer hardware.

// ANALYSIS

Hot take: this feels like a meaningful infrastructure signal, not just a flashy benchmark stunt.

  • The interesting part is the method stack: autonomous experiment loops plus memory-aware inference ideas turned into a practical local-run harness.
  • The reported speed is impressive for a model this large, especially on 48GB unified memory, even if the MoE/sparse setup softens the headline a bit.
  • The poster’s claim that the same hardware might reach roughly 18 tokens/sec suggests there is still a lot of headroom in the loading and access pattern.
  • If this approach generalizes, SSD bandwidth and memory access strategy become first-class deployment constraints for local LLMs.
// TAGS
llm-in-a-flashqwen3.5autoresearchlocal-llmmacbook-proapple-siliconmixture-of-expertsinference

DISCOVERED

23d ago

2026-03-19

PUBLISHED

23d ago

2026-03-19

RELEVANCE

8/ 10

AUTHOR

pscoutou