OPEN_SOURCE ↗
REDDIT · REDDIT// 19d agoBENCHMARK RESULT
llama.cpp OpenVINO backend crawls on Intel NPU
One r/LocalLLaMA user says the `ghcr.io/ggml-org/llama.cpp:server-openvino` Docker image only manages about 0.1 tokens/sec on a Core Ultra 2 255U NPU with Qwen3-4B Q4_0, far below their earlier OpenVINO-native container experiments. They’re asking whether anyone else has seen better throughput on the same backend.
// ANALYSIS
Hot take: this looks less like a breakthrough and more like an early-adopter stress test for Intel NPU support.
- –`llama.cpp` now ships an official OpenVINO Docker target, so the image itself is a maintained path rather than a community hack.
- –The backend docs validate Qwen3-8B, not Qwen3-4B, which suggests this exact model/config is still outside the smooth path.
- –NPU mode is picky: it uses static compilation, special prefill chunking, and no model caching, and `llama-server` has extra limits there.
- –OpenVINO's GGUF support has also been described as preview/WIP in recent docs, so the old OpenVINO-native pipeline and the new llama.cpp path are not apples-to-apples.
// TAGS
llama-cppllminferenceopen-sourceself-hostedbenchmarkedge-ai
DISCOVERED
19d ago
2026-03-24
PUBLISHED
19d ago
2026-03-24
RELEVANCE
7/ 10
AUTHOR
inphaser