Ollama NVFP4 slows with CPU offload
The Reddit thread says Ollama's NVFP4 path drops sharply once the model no longer fits in VRAM and layers spill to CPU. That fits the general rule for local inference: the speed win comes from staying GPU-resident, not from mixing in host-side execution.
This looks less like a broken setup and more like a bandwidth wall. NVFP4 helps when the hot path stays on the GPU; once offload kicks in, transfer overhead and CPU execution erase much of the gain.
- –The 50 tok/s vs 14 tok/s gap is plausible if the model no longer fits cleanly in VRAM
- –CPU offload keeps the model runnable, but token generation becomes hybrid and much slower
- –The real bottleneck is often memory capacity, not raw GPU compute, especially for larger MoE models
- –If you want NVFP4 to pay off, you usually need enough VRAM to keep the full working set on-device
- –Otherwise, a smaller model or a different quantization may beat a fancy format with offload
DISCOVERED
46d ago
2026-05-04
PUBLISHED
46d ago
2026-05-04
RELEVANCE
AUTHOR
6c5d1129