A4B hits 24 tok/s for Gemma 4 26B on RTX 4060

// 90d agoOPENSOURCE RELEASE

A4B hits 24 tok/s for Gemma 4 26B on RTX 4060

A new inference strategy for Mixture-of-Experts models like Gemma 4 26B enables high-performance local deployment on consumer GPUs with limited VRAM. By offloading inactive experts to system RAM and keeping attention layers on the GPU, the A4B project achieves 24 tok/s on an RTX 4060 by leveraging MoE sparsity.

// ANALYSIS

This technique turns the MoE architecture's massive total weight size into a deployment advantage for local users.

–Exploit MoE sparsity to treat system RAM as a dynamic swap for inactive experts, significantly outperforming traditional CPU-only inference.
–Maintains high throughput (24 tok/s) on entry-level mobile GPUs, making 20B+ parameter models viable for everyday use.
–Highlights a shift in local LLM optimization where memory bandwidth between RAM and GPU becomes the new bottleneck, potentially favoring MoE over dense models for home servers.

// TAGS

a4bgemma-4moertx-4060llm-offloadinglocal-inferenceinference-optimizationgithub

DISCOVERED

90d ago

2026-04-15

PUBLISHED

90d ago

2026-04-14

RELEVANCE

8/ 10

AUTHOR

Initial_Mousse_8713

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE15m ago

B.AI brings GPT-5.6 to web chat

B.AI has launched the OpenAI GPT-5.6 model suite directly on its web chat interface, allowing users to run the Sol, Terra, and Luna models instantly from the browser. This integration enables developers and users to leverage advanced reasoning and coding capabilities without needing API keys or complex setups.

UPDATE34m ago

Lightpanda adds HTTP MCP multi-session support

Lightpanda, a Zig-based headless browser, has introduced Model Context Protocol (MCP) support over HTTP and multi-session capability to enable parallel execution of AI agents. Each connection is routed to an isolated browsing session via session ID headers, optimized through V8 isolate parking.

NEWS1h ago

AI market shifts from benchmarks to utility

In the early stages of the AI boom, market dynamics were defined by a straightforward race to build the smartest model with the highest benchmark scores. However, as the ecosystem matures, raw computational power and peak capabilities are no longer the sole measures of success, meaning the most powerful AI models may not necessarily become the most important or widely adopted.