Qwen3.6-27B NVFP4 quants miss q_scale

// 2h agoINFRASTRUCTURE

Qwen3.6-27B NVFP4 quants miss q_scale

vLLM is warning that these Qwen3.6-27B NVFP4 checkpoints do not include an explicit q scaling factor, so it falls back to k_scale. That is usually a checkpoint/export issue rather than a broken local setup, but it matters if you are using FP8 attention backends like flash-attn or flashinfer.

// ANALYSIS

Likely missing checkpoint metadata, not a bad serving setup. vLLM is degrading gracefully here, which means the model can still run, but the quant export path probably did not preserve the full FP8 attention scale set.

–The warning only matters for FP8 attention backends; on other attention paths it is mostly informational
–vLLM copies k_scale into q_scale as a fallback, which keeps inference moving but may not be the cleanest setup for accuracy
–This points at the quantization/export recipe for these NVFP4 repos, not at VRAM limits or prompt length
–If you care about FP8 correctness, compare against a checkpoint that explicitly ships q/k/v scales or regenerate the quant with a better exporter version

// TAGS

llmquantizationinferencegpuopen-sourceself-hostedqwen3.6-27b

DISCOVERED

2h ago

2026-05-09

PUBLISHED

3h ago

2026-05-09

RELEVANCE

8/ 10

AUTHOR

ziphnor

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE12m ago

TanStack AI grows open-source stack

TanStack AI is TanStack’s framework-agnostic, open-source SDK for building type-safe AI apps with streaming, tool calling, and provider adapters. It is designed to fit into existing TanStack and React stacks instead of forcing a new AI-specific architecture.

OPEN SOURCE16m ago

OpenCode v1.14.43 fixes ACP replay, provider serialization

OpenCode v1.14.43 fixes ACP updates and session replay so tool image attachments survive. It also keeps provider and config API responses working when auth loaders add non-JSON provider options.

OPEN SOURCE31m ago

Agentmemory ships benchmarked memory layer for copilots

agentmemory is an open-source persistent memory runtime for AI coding agents that captures prompts, tool calls, and session events, then consolidates them into searchable memory across Claude Code, Cursor, Gemini CLI, Codex CLI, OpenCode, and other MCP or HTTP clients. The project positions itself as a full memory engine rather than a library or vector store, with auto-capture hooks, hybrid retrieval over BM25 plus vector plus graph search, and a live viewer for replaying sessions. Its README highlights strong retrieval and token-efficiency claims, including 95.2% R@5 on LongMemEval-S and roughly 92% fewer tokens than pushing full context.

Qwen3.6-27B NVFP4 quants miss q_scale