BACK_TO_FEEDAICRIER_2
Qwen3.5 prefix caching slashes TTFT 22s to 2s
OPEN_SOURCE ↗
REDDIT · REDDIT// 6h agoBENCHMARK RESULT

Qwen3.5 prefix caching slashes TTFT 22s to 2s

A Mac Studio user running Qwen3.5-397B-A17B locally cut warm-request TTFT from 22 seconds to 2 seconds by using prefix caching and chunked prefill. The post also documents a vMLX MLLM crash on long multimodal prefills and the hybrid SSM cache limits that make MLX serving stacks fragile for some models.

// ANALYSIS

The big story here is not just faster inference, it's that serving-layer details now decide whether a 397B local model feels usable or broken.

  • Prefix caching is doing the heavy lift: the repeated 12K-token agent prefix becomes effectively free on warm requests, which matters far more than marginal token/s gains.
  • The vMLX failure mode is a deployment trap: `--prefill-step-size` works in `SimpleEngine`, but Qwen3.5 auto-routes to the MLLM batched path, where it apparently stops helping and can OOM on real long-context workloads.
  • The SSM warning is legitimate: hybrid attention + recurrent state is not a standard KV-cache problem, so MLX stacks that assume trimmable cache semantics can misbehave on Qwen3.5-like models.
  • For a single-user local assistant, dropping continuous batching and paged KV is a rational tradeoff if it buys stability plus working prefix reuse.
  • The measurement is useful because it separates warm-cache latency from cold-cache latency instead of hand-waving about model size or quantization.
// TAGS
qwen3.5llminferenceprefix-cachingself-hostedmultimodalbenchmark

DISCOVERED

6h ago

2026-04-18

PUBLISHED

8h ago

2026-04-18

RELEVANCE

8/ 10

AUTHOR

trevorbg