Llama.cpp fixes critical MTP server VRAM leak
Llama.cpp release b9274 resolves a severe memory leak in the server component that affected users of Multi-Token Prediction (MTP) models. The fix ensures that speculative decoders and draft contexts are properly destroyed during sleep/resume cycles, preventing progressive GPU memory exhaustion.
This is a critical stability patch for anyone running speculative decoding in production via the llama.cpp server.
- –Prior to this fix, the server would repeatedly allocate new draft contexts without freeing old ones during idle sleep, inevitably leading to OOM crashes.
- –The patch guarantees that `ctx_dft` and `model_dft` are explicitly freed in the `destroy()` function.
- –It highlights the ongoing challenges of state management in complex local LLM inference setups, particularly when mixing speculative decoding with idle resource pausing.
DISCOVERED
1h ago
2026-05-22
PUBLISHED
3h ago
2026-05-21
RELEVANCE
AUTHOR
Bulky-Priority6824