OPEN_SOURCE ↗
REDDIT · REDDIT// 1d agoPRODUCT UPDATE
SGLang patches FP8 cache, image leak
Two SGLang PRs surfaced bugs that matter in production: an FP8 KV cache corruption issue on radix-cache prefix hits and a GPU memory leak on Qwen-VL-style image requests. Both were silent failures, which makes them especially risky for operators running FP8 and multimodal workloads.
// ANALYSIS
The real story here is not just that SGLang had bugs, but that they lived in edge paths high-performance stacks often miss until users hit them in production.
- –The FP8 issue hit the ragged+paged split in `forward_extend()`, where cached-prefix attention dropped `k_scale`/`v_scale` and quietly degraded outputs
- –That makes FP8 deployments of models like Qwen, DeepSeek-V4, and Gemma 4 more brittle than their BF16 counterparts unless these paths are covered by tests
- –The image-request leak is a classic multimodal cleanup bug: `release_features()` freed pixel tensors but left GPU-resident mrope position tensors behind
- –Silent correctness regressions are worse than crashes because they can poison results while looking “healthy” in observability dashboards
- –If you run SGLang in production, this is a reminder to stress uncommon cache, decode, and vision paths before rolling FP8 or VL traffic broadly
// TAGS
sglanginferencegpuquantizationmultimodalopen-sourcedebugginginfrastructure
DISCOVERED
1d ago
2026-05-01
PUBLISHED
1d ago
2026-05-01
RELEVANCE
8/ 10
AUTHOR
sacrelege