SGLang, ExLlamaV2 hit sub-150ms TTFT for Qwen3.5-9B

// 111d agoBENCHMARK RESULT

SGLang, ExLlamaV2 hit sub-150ms TTFT for Qwen3.5-9B

Benchmarks for real-time voice chat pipelines identify SGLang and ExLlamaV2 as the performance leaders for Qwen 3.5 9B. On RTX 3090 Ti hardware, these engines achieve the sub-150ms Time To First Token (TTFT) required for seamless human-AI conversation.

// ANALYSIS

Qwen 3.5 9B is a dense model that demands high memory bandwidth, making the selection of an inference backend a make-or-break decision for "Time to Sentence" latency.

–SGLang is currently the "gold standard" for lowest latency due to aggressive kernel fusion and its dedicated low-latency mode which pre-allocates KV cache.
–Multi-Token Prediction (MTP) with a 5-token lookahead significantly boosts decoding speeds on Ampere-class GPUs, nearly doubling raw tokens per second (TPS).
–While speculative decoding with a draft model (like Qwen 0.6B) can increase throughput, the initial overhead often negates TTFT gains in single-stream real-time use cases.
–Transitioning from FP16 to optimized FP8 or EXL2 quants (4.0-5.0 bpw) is mandatory to hit the 500-700ms total response time target for conversational interaction.

// TAGS

qwen3-5-9bllminferencesglangexllamav2vllmbenchmarkopen-weights

DISCOVERED

111d ago

2026-03-23

PUBLISHED

111d ago

2026-03-22

RELEVANCE

8/ 10

AUTHOR

Nasa1423

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE37m ago

OpenDisplay turns iOS devices into Mac monitors

OpenDisplay is an open-source utility that streams macOS desktops to iPads or iPhones over USB or Wi-Fi, turning them into low-latency, high-resolution external monitors. Leveraging macOS's private CGVirtualDisplay API, ScreenCaptureKit, and VideoToolbox, it integrates directly into macOS Display settings as a true extended display without needing external servers or telemetry.

OPEN SOURCE37m ago

NASA releases SpaceWasm flight WebAssembly interpreter

spacewasm is a WebAssembly interpreter developed by NASA and Caltech for safety-critical flight software. Written in Rust, it decodes Wasm modules in a single pass into an optimized intermediate representation and utilizes a custom memory model with fixed-size allocation pages to guarantee deterministic execution and avoid memory panics in resource-constrained embedded systems.

OPEN SOURCE37m ago

Agent Skills guides agent UI design

Agent Skills is an open-source library and prompting system designed to help front-end coding agents like Cursor and Claude Code build premium user interfaces. The project provides reusable design guardrails and procedural workflows for advanced styling, GSAP animations, and WebGL.