Intel Core Ultra NPUs run LLMs via OpenVINO
Intel's NPUs are graduating from specialized image tasks to general-purpose LLM inference. New benchmarks show the NPU achieving 6-12 tokens/sec on 8B models, offering a power-efficient alternative to GPUs for local AI.
The shift toward NPU-based LLM inference is the defining feature of the "AI PC" era, prioritizing battery life over raw speed.
- –OpenVINO 2026.0 introduces speculative decoding on NPUs, significantly narrowing the latency gap with iGPUs.
- –While an iGPU can hit 20 tokens/sec, the NPU's 10 tokens/sec consumes 3x-5x less power, ideal for persistent background assistants.
- –Real-world testing confirms secure, GDPR-compliant local processing is viable, such as batch processing 15,000 images in 7 hours on mobile hardware.
- –Expanded model support for Qwen2.5 and MiniCPM indicates a rapidly maturing software ecosystem for local edge inference.
DISCOVERED
45d ago
2026-04-12
PUBLISHED
45d ago
2026-04-12
RELEVANCE
AUTHOR
wossnameX