Qwen3.5 prefix caching slashes TTFT 22s to 2s

// 90d agoBENCHMARK RESULT

Qwen3.5 prefix caching slashes TTFT 22s to 2s

A Mac Studio user running Qwen3.5-397B-A17B locally cut warm-request TTFT from 22 seconds to 2 seconds by using prefix caching and chunked prefill. The post also documents a vMLX MLLM crash on long multimodal prefills and the hybrid SSM cache limits that make MLX serving stacks fragile for some models.

// ANALYSIS

The big story here is not just faster inference, it's that serving-layer details now decide whether a 397B local model feels usable or broken.

–Prefix caching is doing the heavy lift: the repeated 12K-token agent prefix becomes effectively free on warm requests, which matters far more than marginal token/s gains.
–The vMLX failure mode is a deployment trap: `--prefill-step-size` works in `SimpleEngine`, but Qwen3.5 auto-routes to the MLLM batched path, where it apparently stops helping and can OOM on real long-context workloads.
–The SSM warning is legitimate: hybrid attention + recurrent state is not a standard KV-cache problem, so MLX stacks that assume trimmable cache semantics can misbehave on Qwen3.5-like models.
–For a single-user local assistant, dropping continuous batching and paged KV is a rational tradeoff if it buys stability plus working prefix reuse.
–The measurement is useful because it separates warm-cache latency from cold-cache latency instead of hand-waving about model size or quantization.

// TAGS

qwen3.5llminferenceprefix-cachingself-hostedmultimodalbenchmark

DISCOVERED

90d ago

2026-04-18

PUBLISHED

90d ago

2026-04-18

RELEVANCE

8/ 10

AUTHOR

trevorbg

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE42m ago

Orca Mobile launches agent chat UI

Orca has released its Chat UI for Orca Mobile in beta for iOS and Android, allowing developers to monitor and control desktop AI coding agents remotely. Developed with RunFusion, the update introduces a free mobile relay service that eliminates the need for a Tailscale setup.

FUNDING48m ago

After Labs emerges from stealth with funding

After Labs is a newly unveiled AI research lab focusing on the development of efficient fluid intelligence. The startup has officially come out of stealth after securing funding, drawing congratulations from prominent AI figures including François Chollet for founders Clem and Matt.

MODEL1h ago

Thinking Machines' Inkling model hits OpenRouter

Thinking Machines Lab has made their new open-weights Mixture-of-Experts (MoE) model, Inkling, available on OpenRouter. The model features 975 billion total and 41 billion active parameters, supports a 1 million token context window, and provides controllable reasoning across text, images, and audio.