llama.cpp OpenVINO backend crawls on Intel NPU

// 109d agoBENCHMARK RESULT

llama.cpp OpenVINO backend crawls on Intel NPU

One r/LocalLLaMA user says the `ghcr.io/ggml-org/llama.cpp:server-openvino` Docker image only manages about 0.1 tokens/sec on a Core Ultra 2 255U NPU with Qwen3-4B Q4_0, far below their earlier OpenVINO-native container experiments. They’re asking whether anyone else has seen better throughput on the same backend.

// ANALYSIS

Hot take: this looks less like a breakthrough and more like an early-adopter stress test for Intel NPU support.

–`llama.cpp` now ships an official OpenVINO Docker target, so the image itself is a maintained path rather than a community hack.
–The backend docs validate Qwen3-8B, not Qwen3-4B, which suggests this exact model/config is still outside the smooth path.
–NPU mode is picky: it uses static compilation, special prefill chunking, and no model caching, and `llama-server` has extra limits there.
–OpenVINO's GGUF support has also been described as preview/WIP in recent docs, so the old OpenVINO-native pipeline and the new llama.cpp path are not apples-to-apples.

// TAGS

llama-cppllminferenceopen-sourceself-hostedbenchmarkedge-ai

DISCOVERED

109d ago

2026-03-24

PUBLISHED

110d ago

2026-03-24

RELEVANCE

7/ 10

AUTHOR

inphaser

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE26m ago

prose stylesheet forces clean AI writing

prose is a lightweight, single-file Markdown prompt configuration that guides AI coding agents to communicate like a direct, confident senior engineer. Appended directly to local agent instruction files, it establishes clear rules to eliminate common AI patterns like cheesy setups, over-bulleted reasoning, and theatrical language.

MODEL3h ago

Reve 2.1 drops native 4K rendering

Reve has released version 2.1 of its creative image generation model, introducing native 4K rendering, object-level editing, and a new "Live Layers" feature. The update enables users to perform localized edits and manage layouts directly, catering to professional design workflows requiring precise control.

OPEN SOURCE3h ago

ABot-World simulates infinite 720p worlds on single GPU

ABot-World is an open-source, action-conditioned infinite world simulator designed to generate interactive 720p environments at 16 frames per second with low latency on a single desktop GPU. By utilizing an NVIDIA RTX 5090 and requiring just 19GB of GPU memory, this embodied world model offers physical compliance, action controllability, and zero-shot generalization, making real-time, interactive environment simulation accessible on consumer-grade hardware.