llama.cpp ubatch tuning lifts prompt throughput 5x

// 2h agoBENCHMARK RESULT

llama.cpp ubatch tuning lifts prompt throughput 5x

On a 24 GB RTX 3090, raising llama.cpp’s physical micro-batch size pushed prompt processing on gpt-oss-120b from roughly 380 tok/s at the default `-ub 512` to about 2,091 tok/s at `-ub 8192`. The tradeoff is more GPU workspace, so the faster setup needs a few more MoE layers offloaded to CPU with `--n-cpu-moe`.

// ANALYSIS

This is a reminder that local LLM speed is often a memory-budget problem, not just a compute problem. For prompt-heavy workloads, ubatch tuning can matter more than buying a faster GPU.

–Prefill throughput jumps from 240 tok/s at `-ub 256` to 2,091 tok/s at `-ub 8192`, which is the real story here.
–Generation barely changes, slipping from about 32 tok/s to about 30 tok/s, so the trade is favorable if your workload is context-heavy.
–The comparison is informal rather than perfectly controlled: the first four points are `pp4096`, while the 8192 point comes from `pp8192`.
–The result is hardware-specific, but the pattern is general: free workspace by moving a little MoE work to CPU, then spend that headroom on a much larger micro-batch.
–For partially offloaded MoE models, this is a tuning knob worth testing before assuming the hardware is the bottleneck.

// TAGS

llama-cppllmbenchmarkinferencegpumoeopen-weightsquantization

DISCOVERED

2h ago

2026-05-12

PUBLISHED

4h ago

2026-05-12

RELEVANCE

8/ 10

AUTHOR

coder543

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE44m ago

OpenCode pins sessions, keeps costs visible

OpenCode users are treating a single session as the working thread for each PR now that compaction keeps long chats usable. Pinning lets them preserve context across a feature branch while making the running token bill visible.

UPDATE1h ago

Claude Code launches agent view

Anthropic added agent view to Claude Code as a research preview, giving users one screen to monitor, peek into, and reattach to multiple background sessions. It turns the tool from a single-threaded terminal agent into a lightweight multi-agent control panel.

TUTORIAL2h ago

Inworld AI hosts voice workshop

Inworld AI and GMI Cloud are running a 30-minute live workshop on getting more out of voice AI. It reads like a practical builder session, not a product launch, aimed at teams shipping realtime speech experiences.