Qwen3.6-27B NVFP4 quants miss q_scale
vLLM is warning that these Qwen3.6-27B NVFP4 checkpoints do not include an explicit q scaling factor, so it falls back to k_scale. That is usually a checkpoint/export issue rather than a broken local setup, but it matters if you are using FP8 attention backends like flash-attn or flashinfer.
Likely missing checkpoint metadata, not a bad serving setup. vLLM is degrading gracefully here, which means the model can still run, but the quant export path probably did not preserve the full FP8 attention scale set.
- –The warning only matters for FP8 attention backends; on other attention paths it is mostly informational
- –vLLM copies k_scale into q_scale as a fallback, which keeps inference moving but may not be the cleanest setup for accuracy
- –This points at the quantization/export recipe for these NVFP4 repos, not at VRAM limits or prompt length
- –If you care about FP8 correctness, compare against a checkpoint that explicitly ships q/k/v scales or regenerate the quant with a better exporter version
DISCOVERED
2h ago
2026-05-09
PUBLISHED
3h ago
2026-05-09
RELEVANCE
AUTHOR
ziphnor