BACK_TO_FEEDAICRIER_2
Qwen3.6-35B-A3B KLD sweep spots quant tradeoffs
OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoBENCHMARK RESULT

Qwen3.6-35B-A3B KLD sweep spots quant tradeoffs

A GPU-side vLLM KLD benchmark compares INT and NVFP4 quantizations for Qwen3.6-35B-A3B using real logits, framing the results as a practical tradeoff between accuracy, speed, and kernel support. The post argues that raw KLD divergence is useful, but still needs to be weighed against use-case-specific evals.

// ANALYSIS

The main point is that quantization is not a simple “lower bits wins” story. On this model, the author’s read is that FP8 can still trail INT8 on quality, while NVFP4 looks attractive on paper but does not automatically deliver better quality or speed in practice.

  • The benchmark is presented as real GPU work in vLLM, not a synthetic offline proxy, so the comparison is aimed at serving reality rather than theory.
  • KLD is used as a divergence signal, but the post correctly notes that better KLD does not guarantee better task evals for a specific workload.
  • The FP8 vs INT8 comparison reinforces the common serving tradeoff: if the kernel path is mature, a higher-precision format can still be the safer deployment choice.
  • NVFP4 is framed as especially sensitive to implementation details and activation handling, which means “4-bit” alone is not a performance guarantee.
  • For practitioners, the useful takeaway is to pick quantization by end-to-end behavior on your own GPU stack, not by headline bit-width alone.
// TAGS
qwen3.6-35b-a3bbenchmarkquantizationkldllminferencegpuvllm

DISCOVERED

5h ago

2026-04-26

PUBLISHED

6h ago

2026-04-25

RELEVANCE

8/ 10

AUTHOR

Phaelon74