BACK_TO_FEEDAICRIER_2
SGLang, ExLlamaV2 hit sub-150ms TTFT for Qwen3.5-9B
OPEN_SOURCE ↗
REDDIT · REDDIT// 20d agoBENCHMARK RESULT

SGLang, ExLlamaV2 hit sub-150ms TTFT for Qwen3.5-9B

Benchmarks for real-time voice chat pipelines identify SGLang and ExLlamaV2 as the performance leaders for Qwen 3.5 9B. On RTX 3090 Ti hardware, these engines achieve the sub-150ms Time To First Token (TTFT) required for seamless human-AI conversation.

// ANALYSIS

Qwen 3.5 9B is a dense model that demands high memory bandwidth, making the selection of an inference backend a make-or-break decision for "Time to Sentence" latency.

  • SGLang is currently the "gold standard" for lowest latency due to aggressive kernel fusion and its dedicated low-latency mode which pre-allocates KV cache.
  • Multi-Token Prediction (MTP) with a 5-token lookahead significantly boosts decoding speeds on Ampere-class GPUs, nearly doubling raw tokens per second (TPS).
  • While speculative decoding with a draft model (like Qwen 0.6B) can increase throughput, the initial overhead often negates TTFT gains in single-stream real-time use cases.
  • Transitioning from FP16 to optimized FP8 or EXL2 quants (4.0-5.0 bpw) is mandatory to hit the 500-700ms total response time target for conversational interaction.
// TAGS
qwen3-5-9bllminferencesglangexllamav2vllmbenchmarkopen-weights

DISCOVERED

20d ago

2026-03-23

PUBLISHED

20d ago

2026-03-22

RELEVANCE

8/ 10

AUTHOR

Nasa1423