llama.cpp ubatch tuning lifts prompt throughput 5x
On a 24 GB RTX 3090, raising llama.cpp’s physical micro-batch size pushed prompt processing on gpt-oss-120b from roughly 380 tok/s at the default `-ub 512` to about 2,091 tok/s at `-ub 8192`. The tradeoff is more GPU workspace, so the faster setup needs a few more MoE layers offloaded to CPU with `--n-cpu-moe`.
This is a reminder that local LLM speed is often a memory-budget problem, not just a compute problem. For prompt-heavy workloads, ubatch tuning can matter more than buying a faster GPU.
- –Prefill throughput jumps from 240 tok/s at `-ub 256` to 2,091 tok/s at `-ub 8192`, which is the real story here.
- –Generation barely changes, slipping from about 32 tok/s to about 30 tok/s, so the trade is favorable if your workload is context-heavy.
- –The comparison is informal rather than perfectly controlled: the first four points are `pp4096`, while the 8192 point comes from `pp8192`.
- –The result is hardware-specific, but the pattern is general: free workspace by moving a little MoE work to CPU, then spend that headroom on a much larger micro-batch.
- –For partially offloaded MoE models, this is a tuning knob worth testing before assuming the hardware is the bottleneck.
DISCOVERED
2h ago
2026-05-12
PUBLISHED
4h ago
2026-05-12
RELEVANCE
AUTHOR
coder543