OPEN_SOURCE ↗
REDDIT · REDDIT// 23d agoNEWS
Apple's LLM in Flash stress-tests Qwen3.5-397B locally
A Reddit discussion spotlights Dan Woods’ experiment combining Karpathy’s autoresearch workflow with Apple’s LLM in a Flash paper to run Qwen3.5-397B on an M3 MacBook Pro with 48GB of RAM at about 5.7 tokens per second. The result is less about making a 397B model “small” and more about showing that flash-aware loading, sparse activation, and iterative harness tuning can make very large models surprisingly usable on consumer hardware.
// ANALYSIS
Hot take: this feels like a meaningful infrastructure signal, not just a flashy benchmark stunt.
- –The interesting part is the method stack: autonomous experiment loops plus memory-aware inference ideas turned into a practical local-run harness.
- –The reported speed is impressive for a model this large, especially on 48GB unified memory, even if the MoE/sparse setup softens the headline a bit.
- –The poster’s claim that the same hardware might reach roughly 18 tokens/sec suggests there is still a lot of headroom in the loading and access pattern.
- –If this approach generalizes, SSD bandwidth and memory access strategy become first-class deployment constraints for local LLMs.
// TAGS
llm-in-a-flashqwen3.5autoresearchlocal-llmmacbook-proapple-siliconmixture-of-expertsinference
DISCOVERED
23d ago
2026-03-19
PUBLISHED
23d ago
2026-03-19
RELEVANCE
8/ 10
AUTHOR
pscoutou