BACK_TO_FEEDAICRIER_2
SGLang patches FP8 cache, image leak
OPEN_SOURCE ↗
REDDIT · REDDIT// 1d agoPRODUCT UPDATE

SGLang patches FP8 cache, image leak

Two SGLang PRs surfaced bugs that matter in production: an FP8 KV cache corruption issue on radix-cache prefix hits and a GPU memory leak on Qwen-VL-style image requests. Both were silent failures, which makes them especially risky for operators running FP8 and multimodal workloads.

// ANALYSIS

The real story here is not just that SGLang had bugs, but that they lived in edge paths high-performance stacks often miss until users hit them in production.

  • The FP8 issue hit the ragged+paged split in `forward_extend()`, where cached-prefix attention dropped `k_scale`/`v_scale` and quietly degraded outputs
  • That makes FP8 deployments of models like Qwen, DeepSeek-V4, and Gemma 4 more brittle than their BF16 counterparts unless these paths are covered by tests
  • The image-request leak is a classic multimodal cleanup bug: `release_features()` freed pixel tensors but left GPU-resident mrope position tensors behind
  • Silent correctness regressions are worse than crashes because they can poison results while looking “healthy” in observability dashboards
  • If you run SGLang in production, this is a reminder to stress uncommon cache, decode, and vision paths before rolling FP8 or VL traffic broadly
// TAGS
sglanginferencegpuquantizationmultimodalopen-sourcedebugginginfrastructure

DISCOVERED

1d ago

2026-05-01

PUBLISHED

1d ago

2026-05-01

RELEVANCE

8/ 10

AUTHOR

sacrelege