OPEN_SOURCE ↗
REDDIT · REDDIT// 20d agoBENCHMARK RESULT
SGLang, ExLlamaV2 hit sub-150ms TTFT for Qwen3.5-9B
Benchmarks for real-time voice chat pipelines identify SGLang and ExLlamaV2 as the performance leaders for Qwen 3.5 9B. On RTX 3090 Ti hardware, these engines achieve the sub-150ms Time To First Token (TTFT) required for seamless human-AI conversation.
// ANALYSIS
Qwen 3.5 9B is a dense model that demands high memory bandwidth, making the selection of an inference backend a make-or-break decision for "Time to Sentence" latency.
- –SGLang is currently the "gold standard" for lowest latency due to aggressive kernel fusion and its dedicated low-latency mode which pre-allocates KV cache.
- –Multi-Token Prediction (MTP) with a 5-token lookahead significantly boosts decoding speeds on Ampere-class GPUs, nearly doubling raw tokens per second (TPS).
- –While speculative decoding with a draft model (like Qwen 0.6B) can increase throughput, the initial overhead often negates TTFT gains in single-stream real-time use cases.
- –Transitioning from FP16 to optimized FP8 or EXL2 quants (4.0-5.0 bpw) is mandatory to hit the 500-700ms total response time target for conversational interaction.
// TAGS
qwen3-5-9bllminferencesglangexllamav2vllmbenchmarkopen-weights
DISCOVERED
20d ago
2026-03-23
PUBLISHED
20d ago
2026-03-22
RELEVANCE
8/ 10
AUTHOR
Nasa1423