BACK_TO_FEEDAICRIER_2
Llama.cpp triples Q8_0 speed on Intel Arc
OPEN_SOURCE ↗
REDDIT · REDDIT// 5d agoPRODUCT UPDATE

Llama.cpp triples Q8_0 speed on Intel Arc

A performance fix for the llama.cpp SYCL backend delivers a 3.1x speedup for Q8_0 quantization on Intel Arc GPUs. By implementing a "reorder" optimization that separates scale factors from weight data, the update enables coalesced memory access and boosts bandwidth utilization from 21% to 66%.

// ANALYSIS

This optimization is a massive win for the Intel Xe2 ecosystem, finally unlocking the bandwidth potential of Battlemage GPUs for high-precision local LLM inference.

  • Q8_0 quantization previously suffered from non-coalesced memory access due to its non-power-of-two 34-byte block size.
  • Reordering allows for efficient GPU cache usage, pushing performance on Qwen3.5-27B from 4.88 t/s to over 15 t/s.
  • The speedup makes Q8_0 quantization faster than the less-precise Q6_K on Intel hardware.
  • The fix also addresses a silent bug in SYCL buffer initialization that prevented the optimization from being enabled for Q8_0 tensors.
  • Verification included binary-patching Intel's closed-source IPEX-LLM to confirm the hardware's theoretical limits before implementing the open-source fix.
// TAGS
llama-cppllmgpuopen-sourceinferencesyclintel-arc

DISCOVERED

5d ago

2026-04-06

PUBLISHED

5d ago

2026-04-06

RELEVANCE

8/ 10

AUTHOR

Katostrofik