YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Ollama NVFP4 slows with CPU offload

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Ollama NVFP4 slows with CPU offload
OPEN LINK ↗
// 46d agoBENCHMARK RESULT

Ollama NVFP4 slows with CPU offload

The Reddit thread says Ollama's NVFP4 path drops sharply once the model no longer fits in VRAM and layers spill to CPU. That fits the general rule for local inference: the speed win comes from staying GPU-resident, not from mixing in host-side execution.

// ANALYSIS

This looks less like a broken setup and more like a bandwidth wall. NVFP4 helps when the hot path stays on the GPU; once offload kicks in, transfer overhead and CPU execution erase much of the gain.

  • The 50 tok/s vs 14 tok/s gap is plausible if the model no longer fits cleanly in VRAM
  • CPU offload keeps the model runnable, but token generation becomes hybrid and much slower
  • The real bottleneck is often memory capacity, not raw GPU compute, especially for larger MoE models
  • If you want NVFP4 to pay off, you usually need enough VRAM to keep the full working set on-device
  • Otherwise, a smaller model or a different quantization may beat a fancy format with offload
// TAGS
ollamallminferencegpuquantizationself-hostedlocal-first

DISCOVERED

46d ago

2026-05-04

PUBLISHED

46d ago

2026-05-04

RELEVANCE

7/ 10

AUTHOR

6c5d1129