OPEN_SOURCE ↗
REDDIT · REDDIT// 14h agoINFRASTRUCTURE
M1 Max MacBooks throttle under LLMs
A LocalLLaMA user says a 64GB M1 Max MacBook Pro starts around 50 tokens/sec but falls to single digits within minutes while running Qwen 3.5 35B A3B. The post asks whether Tahoe, Sequoia, or the machine itself is the real bottleneck for sustained local-LLM inference.
// ANALYSIS
This reads less like a dead-end machine and more like sustained-load physics: 35B-class local inference can push Apple silicon into thermal and power limits fast, and Tahoe-era background work may be adding drag. Users who want stable throughput on a Mac need to think about model size, quantization, cooling, and OS activity together.
- –A 64GB M1 Max is capable, but 35B models are still heavy enough to expose thermal headroom and memory-bandwidth ceilings over time.
- –Reports around macOS Tahoe point to higher temps, constant fan use, and background processes like WindowServer or Spotlight, while some users say Sequoia feels cooler.
- –For local LLMs, smaller quantized models usually give better sustained tokens/sec than chasing a large model that initially benchmarks well and then throttles.
- –If the slowdown happens in minutes, it is worth checking fan behavior, ambient temperature, display scaling, login items, and indexing before blaming the chip outright.
// TAGS
macbook-prollminferencegpuedge-ai
DISCOVERED
14h ago
2026-04-17
PUBLISHED
15h ago
2026-04-17
RELEVANCE
7/ 10
AUTHOR
Ayumu_Kasuga