Gemma 4 hits 81 tok/sec on M5 Max

// 102d agoBENCHMARK RESULT

Gemma 4 hits 81 tok/sec on M5 Max

Google's Gemma 4 26B (A4B) achieves a blistering 81 tokens per second on Apple's M5 Max silicon, leveraging Mixture-of-Experts (MoE) to deliver near-instant reasoning at 114W peak power.

// ANALYSIS

Google's A4B architecture, activating 4 billion of its 26 billion total parameters, allows the M5 Max's 614 GB/s bandwidth to deliver inference speeds formerly reserved for 7B-class models. This 81 tokens per second performance provides the ultra-low latency required for complex, multi-step agentic tool-calling without frustrating wait times. While the 114W peak power draw is impressively efficient, thermal throttling remains a consideration for extended generation sessions. Apple's unified memory architecture continues to be a major advantage, allowing 26B weights to be loaded without the VRAM bottlenecks typical of consumer Nvidia mobile GPUs.

// TAGS

gemma-4llmapple-siliconm5-maxinferenceopen-weightsmoe

DISCOVERED

102d ago

2026-04-03

PUBLISHED

102d ago

2026-04-03

RELEVANCE

8/ 10

AUTHOR

Bderken

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL18m ago

OpenAI GPT-5.6 hits Amazon Bedrock

OpenAI's GPT-5.6 model family—including Sol, Terra, and Luna—is now generally available on Amazon Bedrock. Running on Bedrock's next-generation inference engine, the models support prompt caching with a 90% discount and match OpenAI's first-party pricing.

UPDATE1h ago

OpenRouter splits rankings by model weight

OpenRouter has updated its rankings platform by introducing separate leaderboards for open-weight and closed-weight models. This allows developers to track and compare usage statistics of proprietary, API-exclusive models against downloadable open-weight models.

UPDATE1h ago

Codex and Claude Code introduce advanced in-app browser capabilities, including multi-tab support and cookie imports, accelerating the shift toward autonomous computer use.

Codex has updated its in-app browser to support multiple tabs, cookie importing, and password persistence, with Anthropic's Claude Code quickly following with similar web-browsing capabilities. These upgrades allow AI agents to navigate authenticated sites and perform browser-based tasks alongside code editors and terminals. By embedding robust browser control directly into the agentic environment, developers can execute end-to-end workflows without leaving the command line or workspace app.