YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

llama.cpp ubatch tuning lifts prompt throughput 5x

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

llama.cpp ubatch tuning lifts prompt throughput 5x
OPEN LINK ↗
// 2h agoBENCHMARK RESULT

llama.cpp ubatch tuning lifts prompt throughput 5x

On a 24 GB RTX 3090, raising llama.cpp’s physical micro-batch size pushed prompt processing on gpt-oss-120b from roughly 380 tok/s at the default `-ub 512` to about 2,091 tok/s at `-ub 8192`. The tradeoff is more GPU workspace, so the faster setup needs a few more MoE layers offloaded to CPU with `--n-cpu-moe`.

// ANALYSIS

This is a reminder that local LLM speed is often a memory-budget problem, not just a compute problem. For prompt-heavy workloads, ubatch tuning can matter more than buying a faster GPU.

  • Prefill throughput jumps from 240 tok/s at `-ub 256` to 2,091 tok/s at `-ub 8192`, which is the real story here.
  • Generation barely changes, slipping from about 32 tok/s to about 30 tok/s, so the trade is favorable if your workload is context-heavy.
  • The comparison is informal rather than perfectly controlled: the first four points are `pp4096`, while the 8192 point comes from `pp8192`.
  • The result is hardware-specific, but the pattern is general: free workspace by moving a little MoE work to CPU, then spend that headroom on a much larger micro-batch.
  • For partially offloaded MoE models, this is a tuning knob worth testing before assuming the hardware is the bottleneck.
// TAGS
llama-cppllmbenchmarkinferencegpumoeopen-weightsquantization

DISCOVERED

2h ago

2026-05-12

PUBLISHED

4h ago

2026-05-12

RELEVANCE

8/ 10

AUTHOR

coder543