MTP speed falls off past 85K context

REDDIT · REDDIT// 1h agoBENCHMARK RESULT

MTP speed falls off past 85K context

A llama.cpp user ran MTP with Qwen3.6-27B Q4_K_M and charted a full coding session to see what the metrics look like in practice. The standout finding is that generation speed drops hard after roughly 85K context, while cold prefills remain expensive and slot-save still meaningfully improves KV-cache hit rate.

// ANALYSIS

This is a useful reality check for long-context local inference: the feature works, but the tail latency and throughput curve still bend sharply once the session gets really long.

–Performance degradation past 85K context suggests the practical ceiling for “daily driver” coding sessions is lower than the raw context window implies
–Cold prefill cost is still the main tax for new sessions, so reuse and cache persistence matter a lot more than marketing benchmarks
–KV cache slot-save looks like the unsung hero here; improving hit rate is probably more valuable than chasing small decode gains
–Qwen3.6-27B Q4_K_M remains viable for local coding, but this session shows why observability matters more than vibes when you push long contexts
–The post is more of an engineering benchmark note than a launch: it helps separate “usable in practice” from “works on paper”

// TAGS

llmlong-contextinferencebenchmarkself-hostedlocal-firstllama-cppqwen3-6-27b

DISCOVERED

1h ago

2026-05-07

PUBLISHED

3h ago

2026-05-07

RELEVANCE

8/ 10

AUTHOR

admajic

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE57m ago

Matt Pocock's skills add /review workflow

The new `/review` skill tells an agent to check changes against the original spec and coding standards, then propose fixes to both the code and the agent loop that produced it. It pushes AI coding toward explicit quality control instead of just faster output.

NEWS1h ago

EVE Online partners with DeepMind on AI

CCP’s EVE Online studio is starting a research partnership with Google DeepMind to study intelligence in complex, dynamic, player-driven systems. The first experiments will run on offline copies of EVE, not the live game.

UPDATE1h ago

Claude Code raises limits, leans into parallel agents

The video frames Claude Code as more than a terminal coding helper: Anthropic is pushing it toward real engineering throughput with doubled usage limits and workflows built around delegating tasks across multiple agents in parallel. The core pitch stays the same, though. Claude Code is still the company’s agentic coding system for people working directly from the terminal, now being positioned for heavier, more sustained software delivery work.