Gemma 4 CPU Throughput Hits 15 t/s

// 95d agoBENCHMARK RESULT

Gemma 4 CPU Throughput Hits 15 t/s

A LocalLLaMA thread asks how fast Gemma 4 runs on CPU, and the clearest numbers shared are about 8-11 tokens/sec on a Ryzen 5 3600 with a q4_k_m quant. The fastest reported figure in-thread is roughly 15 tokens/sec, but that setup used speculative decoding and was not CPU-only.

// ANALYSIS

The takeaway is that Gemma 4 looks usable on CPU, but the real speed unlock in this thread comes from speculative decoding, not just picking a smaller quant.

–The only concrete CPU-only result shared is 8-11 tok/s on a Ryzen 5 3600, 64 GB DDR4, using the official q4_k_m quant in llama.cpp server.
–The top number, about 15 tok/s, came from a 128 GB Strix Halo system using speculative decoding with a draft model, so it is not a clean CPU-only baseline.
–The thread does not establish a clear “best” quant for both speed and quality, but q4_k_m looks like the practical starting point people are reaching for.
–For local Gemma 4 inference, memory bandwidth and decoding strategy appear to matter as much as raw CPU cores, which is usually the real bottleneck on consumer hardware.

// TAGS

gemma-4llmbenchmarkinferenceopen-source

DISCOVERED

95d ago

2026-04-08

PUBLISHED

95d ago

2026-04-08

RELEVANCE

9/ 10

AUTHOR

last_llm_standing

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

VIDEO1h ago

Higgsfield drops developer CLI and MCP server

Higgsfield has launched a developer CLI and MCP server, allowing programmers and autonomous agents to programmatically trigger, customize, and edit marketing ads and cinematic videos directly through terminal commands. Demonstrated by developer Cole Medin using Anthropic's Claude Code and the Archon workflow engine, the toolkit enables fully automated video production pipelines.

OPEN SOURCE1h ago

AI Content Factory automates video ads

AI Content Factory is an open-source workflow that automates bulk marketing video generation from a product catalog. Built on the Archon agentic engine and Higgsfield CLI, it reduces costs by gating expensive video rendering behind cheap image exploration and human approval.

NEWS3h ago

George Hotz shares his enthusiasm for LLMs and open-source coding agents while criticizing doom-mongering and the overinflated valuations of frontier AI labs.

George Hotz (geohot) details his excitement for the practical applications of AI—such as LLMs, self-driving cars, video generation models, and AI coding agents—highlighting his successful setup of the open-source agent OpenCode on a local GLM-5.2 model. However, he strongly criticizes the prevailing industry hype, safety-related doom-mongering, and the multibillion-dollar valuations of frontier AI labs. Hotz argues that frontier labs will fail to capture most of the AI value because AI is a commodity driven by Moore's law and general computing progress. He also frames coding models not as autonomous creators, but as valuable productivity tools analogous to compilers, find-and-replace, or Stack Overflow that are changing the nature of programming.