Apple's LLM in Flash stress-tests Qwen3.5-397B locally

// 114d agoNEWS

Apple's LLM in Flash stress-tests Qwen3.5-397B locally

A Reddit discussion spotlights Dan Woods’ experiment combining Karpathy’s autoresearch workflow with Apple’s LLM in a Flash paper to run Qwen3.5-397B on an M3 MacBook Pro with 48GB of RAM at about 5.7 tokens per second. The result is less about making a 397B model “small” and more about showing that flash-aware loading, sparse activation, and iterative harness tuning can make very large models surprisingly usable on consumer hardware.

// ANALYSIS

Hot take: this feels like a meaningful infrastructure signal, not just a flashy benchmark stunt.

–The interesting part is the method stack: autonomous experiment loops plus memory-aware inference ideas turned into a practical local-run harness.
–The reported speed is impressive for a model this large, especially on 48GB unified memory, even if the MoE/sparse setup softens the headline a bit.
–The poster’s claim that the same hardware might reach roughly 18 tokens/sec suggests there is still a lot of headroom in the loading and access pattern.
–If this approach generalizes, SSD bandwidth and memory access strategy become first-class deployment constraints for local LLMs.

// TAGS

llm-in-a-flashqwen3.5autoresearchlocal-llmmacbook-proapple-siliconmixture-of-expertsinference

DISCOVERED

114d ago

2026-03-19

PUBLISHED

114d ago

2026-03-19

RELEVANCE

8/ 10

AUTHOR

pscoutou

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE43m ago

OpenDisplay turns iOS devices into Mac monitors

OpenDisplay is an open-source utility that streams macOS desktops to iPads or iPhones over USB or Wi-Fi, turning them into low-latency, high-resolution external monitors. Leveraging macOS's private CGVirtualDisplay API, ScreenCaptureKit, and VideoToolbox, it integrates directly into macOS Display settings as a true extended display without needing external servers or telemetry.

OPEN SOURCE43m ago

NASA releases SpaceWasm flight WebAssembly interpreter

spacewasm is a WebAssembly interpreter developed by NASA and Caltech for safety-critical flight software. Written in Rust, it decodes Wasm modules in a single pass into an optimized intermediate representation and utilizes a custom memory model with fixed-size allocation pages to guarantee deterministic execution and avoid memory panics in resource-constrained embedded systems.

OPEN SOURCE43m ago

Agent Skills guides agent UI design

Agent Skills is an open-source library and prompting system designed to help front-end coding agents like Cursor and Claude Code build premium user interfaces. The project provides reusable design guardrails and procedural workflows for advanced styling, GSAP animations, and WebGL.