BACK_TO_FEEDAICRIER_2
llama.cpp Metal bug stalls long contexts
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoINFRASTRUCTURE

llama.cpp Metal bug stalls long contexts

A LocalLLaMA user reports llama-server repeatedly hitting Apple's Metal kIOGPUCommandBufferCallbackErrorImpactingInteractivity while running Qwen3.6-35B-A3B on an M2 MacBook Pro with a 131K-token context. The failure leaves ggml's Metal backend in an unrecoverable error state, forcing server restarts that do not fix the underlying workload.

// ANALYSIS

This is less a clean product update than a useful stress signal for local AI infrastructure: long-context, reasoning-heavy workloads are still finding hard edges in consumer Apple GPU scheduling.

  • The log points at llama.cpp's Metal backend during near-full-context prompt processing and context checkpoint churn, not an app-layer opencode issue.
  • A 35B-class quantized MoE model plus roughly 122K prompt tokens on 32GB unified memory is an aggressive setup, so memory pressure and GPU interactivity limits are plausible failure factors.
  • The practical mitigation path is likely smaller context, lower batch/ubatch settings, fewer GPU layers, a newer llama.cpp build, or filing a minimal reproducible issue upstream.
  • For developers betting on local coding agents, this is a reminder that "fits in RAM" is not the same as stable under long-running agentic workloads.
// TAGS
llama-cppinferencegpuedge-aiself-hostedllm

DISCOVERED

4h ago

2026-04-21

PUBLISHED

7h ago

2026-04-21

RELEVANCE

7/ 10

AUTHOR

boutell