llama.cpp SYCL fix stops dual Arc RAM bloat

// 97d agoNEWS

llama.cpp SYCL fix stops dual Arc RAM bloat

Dual Intel Arc GPUs can drive llama.cpp's SYCL backend into mirroring device allocations in system RAM, which makes hosts OOM long before VRAM is full. The reported fix swaps `sycl::malloc_device()` for Level Zero `zeMemAllocDevice()` with fallback behavior, keeping host memory flat without hurting inference speed.

// ANALYSIS

This looks like a driver-path trap, not a tuning problem: the model fit was never the issue, the allocation API was. If the fix upstreams cleanly, it removes a major blocker for practical multi-GPU Intel local inference.

–`sycl::malloc_device()` appears to hit xe's DMA-buf/TTM path, so GPU allocations get mirrored into host RAM at allocation time
–`zeMemAllocDevice()` avoids that mirror behavior, and the post claims SYCL kernels can still read the pointers without issues
–The failure mode is nasty because kernel-side staging bypasses cgroup limits and looks like an application-level memory leak
–The evidence points to a stability win more than a performance win: throughput stays flat while RAM usage drops from crash territory to roughly 10%
–If reproducible broadly, this changes the hardware advice for dual Intel Arc setups from "buy more system RAM" to "use the right allocator"

// TAGS

llama-cppinferencegpuopen-sourcellm

DISCOVERED

97d ago

2026-04-08

PUBLISHED

97d ago

2026-04-08

RELEVANCE

8/ 10

AUTHOR

Katostrofik

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE2h ago

scroll-world launches scroll-driven 3D flight skill

scroll-world is an open-source, framework-agnostic agent skill that leverages Higgsfield to generate immersive, scroll-driven 3D camera flights through diorama scenes for landing pages. By rendering seamless connection clips between neighboring frames, it allows developers to build interactive 3D narrative websites navigated simply by scrolling, without requiring heavy game engines.

MODEL3h ago

OpenAI GPT-5.6 hits Amazon Bedrock

OpenAI's GPT-5.6 model family—including Sol, Terra, and Luna—is now generally available on Amazon Bedrock. Running on Bedrock's next-generation inference engine, the models support prompt caching with a 90% discount and match OpenAI's first-party pricing.

UPDATE4h ago

OpenRouter splits rankings by model weight

OpenRouter has updated its rankings platform by introducing separate leaderboards for open-weight and closed-weight models. This allows developers to track and compare usage statistics of proprietary, API-exclusive models against downloadable open-weight models.