OPEN_SOURCE ↗
REDDIT · REDDIT// 5d agoPRODUCT UPDATE
Llama.cpp triples Q8_0 speed on Intel Arc
A performance fix for the llama.cpp SYCL backend delivers a 3.1x speedup for Q8_0 quantization on Intel Arc GPUs. By implementing a "reorder" optimization that separates scale factors from weight data, the update enables coalesced memory access and boosts bandwidth utilization from 21% to 66%.
// ANALYSIS
This optimization is a massive win for the Intel Xe2 ecosystem, finally unlocking the bandwidth potential of Battlemage GPUs for high-precision local LLM inference.
- –Q8_0 quantization previously suffered from non-coalesced memory access due to its non-power-of-two 34-byte block size.
- –Reordering allows for efficient GPU cache usage, pushing performance on Qwen3.5-27B from 4.88 t/s to over 15 t/s.
- –The speedup makes Q8_0 quantization faster than the less-precise Q6_K on Intel hardware.
- –The fix also addresses a silent bug in SYCL buffer initialization that prevented the optimization from being enabled for Q8_0 tensors.
- –Verification included binary-patching Intel's closed-source IPEX-LLM to confirm the hardware's theoretical limits before implementing the open-source fix.
// TAGS
llama-cppllmgpuopen-sourceinferencesyclintel-arc
DISCOVERED
5d ago
2026-04-06
PUBLISHED
5d ago
2026-04-06
RELEVANCE
8/ 10
AUTHOR
Katostrofik