YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

llama.cpp Metal bug stalls long contexts

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

llama.cpp Metal bug stalls long contexts
OPEN LINK ↗
// 45d agoINFRASTRUCTURE

llama.cpp Metal bug stalls long contexts

A LocalLLaMA user reports llama-server repeatedly hitting Apple's Metal kIOGPUCommandBufferCallbackErrorImpactingInteractivity while running Qwen3.6-35B-A3B on an M2 MacBook Pro with a 131K-token context. The failure leaves ggml's Metal backend in an unrecoverable error state, forcing server restarts that do not fix the underlying workload.

// ANALYSIS

This is less a clean product update than a useful stress signal for local AI infrastructure: long-context, reasoning-heavy workloads are still finding hard edges in consumer Apple GPU scheduling.

  • The log points at llama.cpp's Metal backend during near-full-context prompt processing and context checkpoint churn, not an app-layer opencode issue.
  • A 35B-class quantized MoE model plus roughly 122K prompt tokens on 32GB unified memory is an aggressive setup, so memory pressure and GPU interactivity limits are plausible failure factors.
  • The practical mitigation path is likely smaller context, lower batch/ubatch settings, fewer GPU layers, a newer llama.cpp build, or filing a minimal reproducible issue upstream.
  • For developers betting on local coding agents, this is a reminder that "fits in RAM" is not the same as stable under long-running agentic workloads.
// TAGS
llama-cppinferencegpuedge-aiself-hostedllm

DISCOVERED

45d ago

2026-04-21

PUBLISHED

45d ago

2026-04-21

RELEVANCE

7/ 10

AUTHOR

boutell