REDDIT · REDDIT// 4h agoINFRASTRUCTURE

Ollama NVFP4 slows with CPU offload

The Reddit thread says Ollama's NVFP4 path drops sharply once the model no longer fits in VRAM and layers spill to CPU. That fits the general rule for local inference: the speed win comes from staying GPU-resident, not from mixing in host-side execution.

// ANALYSIS

This looks less like a broken setup and more like a bandwidth wall. NVFP4 helps when the hot path stays on the GPU; once offload kicks in, transfer overhead and CPU execution erase much of the gain.

–The 50 tok/s vs 14 tok/s gap is plausible if the model no longer fits cleanly in VRAM
–CPU offload keeps the model runnable, but token generation becomes hybrid and much slower
–The real bottleneck is often memory capacity, not raw GPU compute, especially for larger MoE models
–If you want NVFP4 to pay off, you usually need enough VRAM to keep the full working set on-device
–Otherwise, a smaller model or a different quantization may beat a fancy format with offload

// TAGS

ollamallminferencegpuquantizationself-hostedlocal-first

DISCOVERED

4h ago

2026-05-04

PUBLISHED

5h ago

2026-05-04

RELEVANCE

7/ 10

AUTHOR

6c5d1129