OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoINFRASTRUCTURE
Ollama NVFP4 slows with CPU offload
The Reddit thread says Ollama's NVFP4 path drops sharply once the model no longer fits in VRAM and layers spill to CPU. That fits the general rule for local inference: the speed win comes from staying GPU-resident, not from mixing in host-side execution.
// ANALYSIS
This looks less like a broken setup and more like a bandwidth wall. NVFP4 helps when the hot path stays on the GPU; once offload kicks in, transfer overhead and CPU execution erase much of the gain.
- –The 50 tok/s vs 14 tok/s gap is plausible if the model no longer fits cleanly in VRAM
- –CPU offload keeps the model runnable, but token generation becomes hybrid and much slower
- –The real bottleneck is often memory capacity, not raw GPU compute, especially for larger MoE models
- –If you want NVFP4 to pay off, you usually need enough VRAM to keep the full working set on-device
- –Otherwise, a smaller model or a different quantization may beat a fancy format with offload
// TAGS
ollamallminferencegpuquantizationself-hostedlocal-first
DISCOVERED
4h ago
2026-05-04
PUBLISHED
5h ago
2026-05-04
RELEVANCE
7/ 10
AUTHOR
6c5d1129