MTP speed falls off past 85K context
A llama.cpp user ran MTP with Qwen3.6-27B Q4_K_M and charted a full coding session to see what the metrics look like in practice. The standout finding is that generation speed drops hard after roughly 85K context, while cold prefills remain expensive and slot-save still meaningfully improves KV-cache hit rate.
This is a useful reality check for long-context local inference: the feature works, but the tail latency and throughput curve still bend sharply once the session gets really long.
- –Performance degradation past 85K context suggests the practical ceiling for “daily driver” coding sessions is lower than the raw context window implies
- –Cold prefill cost is still the main tax for new sessions, so reuse and cache persistence matter a lot more than marketing benchmarks
- –KV cache slot-save looks like the unsung hero here; improving hit rate is probably more valuable than chasing small decode gains
- –Qwen3.6-27B Q4_K_M remains viable for local coding, but this session shows why observability matters more than vibes when you push long contexts
- –The post is more of an engineering benchmark note than a launch: it helps separate “usable in practice” from “works on paper”
DISCOVERED
1h ago
2026-05-07
PUBLISHED
3h ago
2026-05-07
RELEVANCE
AUTHOR
admajic