YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Qwen3.5 prefix caching slashes TTFT 22s to 2s

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Qwen3.5 prefix caching slashes TTFT 22s to 2s
OPEN LINK ↗
// 45d agoBENCHMARK RESULT

Qwen3.5 prefix caching slashes TTFT 22s to 2s

A Mac Studio user running Qwen3.5-397B-A17B locally cut warm-request TTFT from 22 seconds to 2 seconds by using prefix caching and chunked prefill. The post also documents a vMLX MLLM crash on long multimodal prefills and the hybrid SSM cache limits that make MLX serving stacks fragile for some models.

// ANALYSIS

The big story here is not just faster inference, it's that serving-layer details now decide whether a 397B local model feels usable or broken.

  • Prefix caching is doing the heavy lift: the repeated 12K-token agent prefix becomes effectively free on warm requests, which matters far more than marginal token/s gains.
  • The vMLX failure mode is a deployment trap: `--prefill-step-size` works in `SimpleEngine`, but Qwen3.5 auto-routes to the MLLM batched path, where it apparently stops helping and can OOM on real long-context workloads.
  • The SSM warning is legitimate: hybrid attention + recurrent state is not a standard KV-cache problem, so MLX stacks that assume trimmable cache semantics can misbehave on Qwen3.5-like models.
  • For a single-user local assistant, dropping continuous batching and paged KV is a rational tradeoff if it buys stability plus working prefix reuse.
  • The measurement is useful because it separates warm-cache latency from cold-cache latency instead of hand-waving about model size or quantization.
// TAGS
qwen3.5llminferenceprefix-cachingself-hostedmultimodalbenchmark

DISCOVERED

45d ago

2026-04-18

PUBLISHED

45d ago

2026-04-18

RELEVANCE

8/ 10

AUTHOR

trevorbg