Gemma 4 31B stalls on MacBook M5 Max

// 45d agoBENCHMARK RESULT

Gemma 4 31B stalls on MacBook M5 Max

Google's Gemma 4 31B model exhibits a 42-second initial latency on Apple M5 Max hardware due to a Flash Attention implementation bug. The bottleneck highlights a critical software-hardware mismatch in the latest hybrid attention architectures.

// ANALYSIS

The 42-second delay is a wake-up call for local inference: specialized hardware like the M5 Max requires tighter framework integration to handle complex hybrid models.

–Root cause is a Flash Attention bug failing to handle Gemma 4's dual-dimension (256/512) head architecture across its hybrid layers.
–Disabling Flash Attention (`FA=0`) bypasses the stall but caps generation at ~21 tokens per second, underutilizing the M5 Max's bandwidth.
–Native MLX runners show 4x better performance than standard GGUF implementations, proving the "slowdown" is a software stack issue, not hardware limitation.
–Massive KV cache requirements (up to 22GB) mean 128GB of unified memory is becoming the new floor for "research-grade" local LLMs.

// TAGS

gemma-4googlellmlocal-firstinferenceapple-siliconm5-maxquantization

DISCOVERED

45d ago

2026-05-28

PUBLISHED

45d ago

2026-05-28

RELEVANCE

8/ 10

AUTHOR

bridgemindai

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

VIDEO38m ago

Higgsfield drops developer CLI and MCP server

Higgsfield has launched a developer CLI and MCP server, allowing programmers and autonomous agents to programmatically trigger, customize, and edit marketing ads and cinematic videos directly through terminal commands. Demonstrated by developer Cole Medin using Anthropic's Claude Code and the Archon workflow engine, the toolkit enables fully automated video production pipelines.

OPEN SOURCE38m ago

AI Content Factory automates video ads

AI Content Factory is an open-source workflow that automates bulk marketing video generation from a product catalog. Built on the Archon agentic engine and Higgsfield CLI, it reduces costs by gating expensive video rendering behind cheap image exploration and human approval.

NEWS2h ago

George Hotz shares his enthusiasm for LLMs and open-source coding agents while criticizing doom-mongering and the overinflated valuations of frontier AI labs.

George Hotz (geohot) details his excitement for the practical applications of AI—such as LLMs, self-driving cars, video generation models, and AI coding agents—highlighting his successful setup of the open-source agent OpenCode on a local GLM-5.2 model. However, he strongly criticizes the prevailing industry hype, safety-related doom-mongering, and the multibillion-dollar valuations of frontier AI labs. Hotz argues that frontier labs will fail to capture most of the AI value because AI is a commodity driven by Moore's law and general computing progress. He also frames coding models not as autonomous creators, but as valuable productivity tools analogous to compilers, find-and-replace, or Stack Overflow that are changing the nature of programming.