SGLang patches FP8 cache, image leak
Two SGLang PRs surfaced bugs that matter in production: an FP8 KV cache corruption issue on radix-cache prefix hits and a GPU memory leak on Qwen-VL-style image requests. Both were silent failures, which makes them especially risky for operators running FP8 and multimodal workloads.
The real story here is not just that SGLang had bugs, but that they lived in edge paths high-performance stacks often miss until users hit them in production.
- –The FP8 issue hit the ragged+paged split in `forward_extend()`, where cached-prefix attention dropped `k_scale`/`v_scale` and quietly degraded outputs
- –That makes FP8 deployments of models like Qwen, DeepSeek-V4, and Gemma 4 more brittle than their BF16 counterparts unless these paths are covered by tests
- –The image-request leak is a classic multimodal cleanup bug: `release_features()` freed pixel tensors but left GPU-resident mrope position tensors behind
- –Silent correctness regressions are worse than crashes because they can poison results while looking “healthy” in observability dashboards
- –If you run SGLang in production, this is a reminder to stress uncommon cache, decode, and vision paths before rolling FP8 or VL traffic broadly
DISCOVERED
46d ago
2026-05-01
PUBLISHED
47d ago
2026-05-01
RELEVANCE
AUTHOR
sacrelege
