YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

SGLang patches FP8 cache, image leak

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

SGLang patches FP8 cache, image leak
OPEN LINK ↗
// 46d agoPRODUCT UPDATE

SGLang patches FP8 cache, image leak

Two SGLang PRs surfaced bugs that matter in production: an FP8 KV cache corruption issue on radix-cache prefix hits and a GPU memory leak on Qwen-VL-style image requests. Both were silent failures, which makes them especially risky for operators running FP8 and multimodal workloads.

// ANALYSIS

The real story here is not just that SGLang had bugs, but that they lived in edge paths high-performance stacks often miss until users hit them in production.

  • The FP8 issue hit the ragged+paged split in `forward_extend()`, where cached-prefix attention dropped `k_scale`/`v_scale` and quietly degraded outputs
  • That makes FP8 deployments of models like Qwen, DeepSeek-V4, and Gemma 4 more brittle than their BF16 counterparts unless these paths are covered by tests
  • The image-request leak is a classic multimodal cleanup bug: `release_features()` freed pixel tensors but left GPU-resident mrope position tensors behind
  • Silent correctness regressions are worse than crashes because they can poison results while looking “healthy” in observability dashboards
  • If you run SGLang in production, this is a reminder to stress uncommon cache, decode, and vision paths before rolling FP8 or VL traffic broadly
// TAGS
sglanginferencegpuquantizationmultimodalopen-sourcedebugginginfrastructure

DISCOVERED

46d ago

2026-05-01

PUBLISHED

47d ago

2026-05-01

RELEVANCE

8/ 10

AUTHOR

sacrelege