Gemma 4 26B-A4B Trails 7B GPTQ

// 90d agoINFRASTRUCTURE

Gemma 4 26B-A4B Trails 7B GPTQ

This post asks why Gemma 4 26B A4B feels slower on vLLM than a previous Qwen 2.5 VL 7B GPTQ int4 setup, despite the model activating only about 4B parameters per token. The core issue is that sparse activation does not automatically translate to lower end-to-end latency: MoE routing, expert dispatch, multimodal plumbing, and framework/kernel support all affect speed.

// ANALYSIS

Hot take: “4B active” is not the same as “4B fast.” Inference latency is dominated by the whole serving stack, not just active parameter count.

–Gemma 4 26B A4B is a sparse MoE model; routing tokens through experts adds overhead that a dense 7B GPTQ model does not have.
–vLLM’s MoE path depends on expert-parallel and optimized kernels; if the deployment is not tuned for MoE, throughput and latency can suffer.
–GPTQ int4 on a 7B model is extremely bandwidth-efficient, so the smaller dense model can win on decode speed even if its raw quality is lower.
–Gemma 4 is natively multimodal and built for long-context workloads, which can add serving complexity even when you are using text-only prompts.
–If the model or parts of it are spilling off GPU, or if batch size/context length is high, the MoE advantage can disappear quickly.

// TAGS

gemmamoevllminferencelatencythroughputquantizationmultimodal

DISCOVERED

90d ago

2026-04-17

PUBLISHED

91d ago

2026-04-16

RELEVANCE

8/ 10

AUTHOR

everyoneisodd

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE2h ago

Lightpanda agent REPL renders styled terminal markdown

Lightpanda has introduced a markdown-to-ANSI terminal renderer for its interactive agent REPL, styling headings, lists, inline formatting, and OSC 8 clickable links. The rendering is gated exclusively to interactive TTY sessions to avoid breaking machine-readable piped workflows.

VIDEO2h ago

Kimi K3 Teaser Hints at Hybrid Recurrent-Attention

Moonshot AI has released a teaser video for Kimi K3, prompting analysis of its architectural concepts. Visual metaphors in the video hint at a shift from Kimi K2's transformer backbone to a memory-efficient, recurrent hybrid architecture.

OPEN SOURCE2h ago

NextChat unifies Claude, DeepSeek, GPT-4, and Gemini Pro

NextChat (formerly ChatGPT-Next-Web) is a highly versatile, open-source AI client that provides a fast and unified interface for accessing top-tier LLMs like Claude, GPT-4, DeepSeek, and Gemini Pro. It is available across web, desktop, and iOS, features Model Context Protocol (MCP) support, and provides an enterprise edition with extensive brand customization options.