Qwen3.5 prefix caching slashes TTFT 22s to 2s
A Mac Studio user running Qwen3.5-397B-A17B locally cut warm-request TTFT from 22 seconds to 2 seconds by using prefix caching and chunked prefill. The post also documents a vMLX MLLM crash on long multimodal prefills and the hybrid SSM cache limits that make MLX serving stacks fragile for some models.
The big story here is not just faster inference, it's that serving-layer details now decide whether a 397B local model feels usable or broken.
- –Prefix caching is doing the heavy lift: the repeated 12K-token agent prefix becomes effectively free on warm requests, which matters far more than marginal token/s gains.
- –The vMLX failure mode is a deployment trap: `--prefill-step-size` works in `SimpleEngine`, but Qwen3.5 auto-routes to the MLLM batched path, where it apparently stops helping and can OOM on real long-context workloads.
- –The SSM warning is legitimate: hybrid attention + recurrent state is not a standard KV-cache problem, so MLX stacks that assume trimmable cache semantics can misbehave on Qwen3.5-like models.
- –For a single-user local assistant, dropping continuous batching and paged KV is a rational tradeoff if it buys stability plus working prefix reuse.
- –The measurement is useful because it separates warm-cache latency from cold-cache latency instead of hand-waving about model size or quantization.
DISCOVERED
45d ago
2026-04-18
PUBLISHED
45d ago
2026-04-18
RELEVANCE
AUTHOR
trevorbg