BACK_TO_FEEDAICRIER_2
llama.cpp OpenVINO backend crawls on Intel NPU
OPEN_SOURCE ↗
REDDIT · REDDIT// 19d agoBENCHMARK RESULT

llama.cpp OpenVINO backend crawls on Intel NPU

One r/LocalLLaMA user says the `ghcr.io/ggml-org/llama.cpp:server-openvino` Docker image only manages about 0.1 tokens/sec on a Core Ultra 2 255U NPU with Qwen3-4B Q4_0, far below their earlier OpenVINO-native container experiments. They’re asking whether anyone else has seen better throughput on the same backend.

// ANALYSIS

Hot take: this looks less like a breakthrough and more like an early-adopter stress test for Intel NPU support.

  • `llama.cpp` now ships an official OpenVINO Docker target, so the image itself is a maintained path rather than a community hack.
  • The backend docs validate Qwen3-8B, not Qwen3-4B, which suggests this exact model/config is still outside the smooth path.
  • NPU mode is picky: it uses static compilation, special prefill chunking, and no model caching, and `llama-server` has extra limits there.
  • OpenVINO's GGUF support has also been described as preview/WIP in recent docs, so the old OpenVINO-native pipeline and the new llama.cpp path are not apples-to-apples.
// TAGS
llama-cppllminferenceopen-sourceself-hostedbenchmarkedge-ai

DISCOVERED

19d ago

2026-03-24

PUBLISHED

19d ago

2026-03-24

RELEVANCE

7/ 10

AUTHOR

inphaser