Gemma 4 31B stalls on MacBook M5 Max
Google's Gemma 4 31B model exhibits a 42-second initial latency on Apple M5 Max hardware due to a Flash Attention implementation bug. The bottleneck highlights a critical software-hardware mismatch in the latest hybrid attention architectures.
The 42-second delay is a wake-up call for local inference: specialized hardware like the M5 Max requires tighter framework integration to handle complex hybrid models.
- –Root cause is a Flash Attention bug failing to handle Gemma 4's dual-dimension (256/512) head architecture across its hybrid layers.
- –Disabling Flash Attention (`FA=0`) bypasses the stall but caps generation at ~21 tokens per second, underutilizing the M5 Max's bandwidth.
- –Native MLX runners show 4x better performance than standard GGUF implementations, proving the "slowdown" is a software stack issue, not hardware limitation.
- –Massive KV cache requirements (up to 22GB) mean 128GB of unified memory is becoming the new floor for "research-grade" local LLMs.
DISCOVERED
1h ago
2026-05-28
PUBLISHED
1h ago
2026-05-28
RELEVANCE
AUTHOR
bridgemindai