llama.cpp MTP slashes Qwen 27B context on 3090
A LocalLLaMA user reports that enabling Multi-Token Prediction (MTP) for Qwen 27B in llama.cpp drops available context from 137k to 14k on a 24GB RTX 3090. The dramatic reduction highlights the massive VRAM overhead required for drafting states in local speculative decoding.
Speculative decoding speeds up inference but taxes memory heavily.
- –MTP drafts require parallel KV cache states, eating into the VRAM otherwise used for main context
- –On a 24GB card, running a 27B parameter model at Q4 leaves little room for massive contexts once draft states are enabled
- –Users must choose between faster token generation via MTP and long-context capabilities on constrained hardware
DISCOVERED
1h ago
2026-05-27
PUBLISHED
5h ago
2026-05-27
RELEVANCE
AUTHOR
regunakyle