llama.cpp Metal bug stalls long contexts

// 90d agoINFRASTRUCTURE

llama.cpp Metal bug stalls long contexts

A LocalLLaMA user reports llama-server repeatedly hitting Apple's Metal kIOGPUCommandBufferCallbackErrorImpactingInteractivity while running Qwen3.6-35B-A3B on an M2 MacBook Pro with a 131K-token context. The failure leaves ggml's Metal backend in an unrecoverable error state, forcing server restarts that do not fix the underlying workload.

// ANALYSIS

This is less a clean product update than a useful stress signal for local AI infrastructure: long-context, reasoning-heavy workloads are still finding hard edges in consumer Apple GPU scheduling.

–The log points at llama.cpp's Metal backend during near-full-context prompt processing and context checkpoint churn, not an app-layer opencode issue.
–A 35B-class quantized MoE model plus roughly 122K prompt tokens on 32GB unified memory is an aggressive setup, so memory pressure and GPU interactivity limits are plausible failure factors.
–The practical mitigation path is likely smaller context, lower batch/ubatch settings, fewer GPU layers, a newer llama.cpp build, or filing a minimal reproducible issue upstream.
–For developers betting on local coding agents, this is a reminder that "fits in RAM" is not the same as stable under long-running agentic workloads.

// TAGS

llama-cppinferencegpuedge-aiself-hostedllm

DISCOVERED

90d ago

2026-04-21

PUBLISHED

90d ago

2026-04-21

RELEVANCE

7/ 10

AUTHOR

boutell

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE1h ago

Blume launches zero-config AI-ready documentation framework

Blume is a zero-config documentation framework that transforms directories of Markdown or MDX files into optimized Astro documentation websites. It features built-in Model Context Protocol (MCP) server integration and automatic llms.txt generation for seamless ingestion by AI coding agents.

RESEARCH2h ago

Apple unveils Environment-free Synthetic Data Generation

Apple researchers propose Environment-free Synthetic Data Generation (EFSG), a framework using LLMs as on-the-fly simulators to generate API-calling trajectories without backend environments. Fine-tuning models on these synthetic trajectories yields significant performance gains on benchmarks like AppWorld and OfficeBench.

UPDATE3h ago

Perplexity Computer Adds Financial Data Auditing

Perplexity CEO Aravind Srinivas announced a new data-lineage and auditing feature for Perplexity Computer. Users can now trace generated financial data back to primary sources like SEC filings and view the exact calculations used.