YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Gemma 4 CPU Throughput Hits 15 t/s

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Gemma 4 CPU Throughput Hits 15 t/s
OPEN LINK ↗
// 49d agoBENCHMARK RESULT

Gemma 4 CPU Throughput Hits 15 t/s

A LocalLLaMA thread asks how fast Gemma 4 runs on CPU, and the clearest numbers shared are about 8-11 tokens/sec on a Ryzen 5 3600 with a q4_k_m quant. The fastest reported figure in-thread is roughly 15 tokens/sec, but that setup used speculative decoding and was not CPU-only.

// ANALYSIS

The takeaway is that Gemma 4 looks usable on CPU, but the real speed unlock in this thread comes from speculative decoding, not just picking a smaller quant.

  • The only concrete CPU-only result shared is 8-11 tok/s on a Ryzen 5 3600, 64 GB DDR4, using the official q4_k_m quant in llama.cpp server.
  • The top number, about 15 tok/s, came from a 128 GB Strix Halo system using speculative decoding with a draft model, so it is not a clean CPU-only baseline.
  • The thread does not establish a clear “best” quant for both speed and quality, but q4_k_m looks like the practical starting point people are reaching for.
  • For local Gemma 4 inference, memory bandwidth and decoding strategy appear to matter as much as raw CPU cores, which is usually the real bottleneck on consumer hardware.
// TAGS
gemma-4llmbenchmarkinferenceopen-source

DISCOVERED

49d ago

2026-04-08

PUBLISHED

49d ago

2026-04-08

RELEVANCE

9/ 10

AUTHOR

last_llm_standing