APEX quantization boosts Gemma 4 MoE performance

// 46d agoMODEL RELEASE

APEX quantization boosts Gemma 4 MoE performance

Developer mudler released APEX (Adaptive Precision for EXpert Models), a quantization format optimized for Mixture-of-Experts models like Google's Gemma 4. It achieves 38 tokens per second at a 90,000-token context window while solving long-context looping issues common in standard quants.

// ANALYSIS

APEX quants represent a significant shift for sparse architectures, proving that uniform bit-width is inefficient for Mixture-of-Experts models. By protecting critical "edge" layers and shared experts while aggressively compressing redundant routed experts, this method allows 26B models to run with the speed and footprint of much smaller variants without sacrificing intelligence.

–MoE Efficiency: Exploits sparse activation to fit 26B parameters into ~15GB of VRAM, ideal for 16GB consumer cards.
–Context Stability: Demonstrates superior stability at 50k+ context compared to standard UD-Q5 quants, which often suffer from repetition loops.
–Performance Sweet Spot: Delivers high-speed inference (38 tps) that makes large-scale local LLMs viable for real-time applications.

// TAGS

gemma-4llmquantizationapexlocalaimoeapex-quantization

DISCOVERED

46d ago

2026-05-23

PUBLISHED

46d ago

2026-05-23

RELEVANCE

8/ 10

AUTHOR

Any-Chipmunk5480

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS36m ago

Pieter Levels runs Claude Code on production

Indie maker Pieter Levels (@levelsio) demonstrated using Anthropic's command-line agent Claude Code directly on a production server to instantly build backend API routes. These endpoints are immediately consumed by a native Swift iOS app, showcasing a seamless server-to-mobile development loop.

RESEARCH45m ago

GaP boosts robot policy reliability

Graph-as-Policy (GaP) is a robotics framework designed to bridge the reliability gap in variational automation tasks by representing robot policies as directed computation graphs of modular skills. Developed by researchers from UC Berkeley, NVIDIA, CMU, and Bosch, the framework uses an LLM-in-the-loop multi-agent system to construct, test, and refine these graphs in simulation before deployment.

VIDEO45m ago

HASE co-evolves model weights, harness

Harness-Aware Self-Evolving (HASE) is an agentic reinforcement learning framework that allows a single model to co-evolve its policy weights, task solutions, and environment harness in a unified multi-turn action space. By enabling the model to dynamically modify harness components like prompts, memory formatting, and validators, HASE allows smaller models like Qwen3-8B to match or beat the performance of models as large as 120B parameters in domains like text classification and alpha factor mining.