OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoINFRASTRUCTURE
Qwen3.6 quants split local inference crowd
A LocalLLaMA thread compares Qwen3.6-27B FP8, 6-bit AWQ, and AWQ BF16-INT4 builds for vLLM on dual RTX 3090s. The practical split is memory, kernel support, and quality: official FP8 is the safer accuracy pick, while INT4/AWQ variants trade some fidelity for fitting and throughput on consumer GPUs.
// ANALYSIS
This is less a model launch story than a reminder that “same size on Hugging Face” does not mean “same runtime behavior” in vLLM.
- –BF16-INT4 usually means INT4 weight quantization with BF16 activations or remaining tensors, closer to W4A16 than full low-precision FP8 execution
- –Official FP8 is likely the most trustworthy quality target because Qwen says its fine-grained FP8 metrics are nearly identical to the base model
- –RTX 3090-class Ampere cards lack the clean native FP8 path of newer datacenter GPUs, so AWQ/GPTQ INT4 kernels may be more practical even when they lose more accuracy
- –The 6-bit AWQ build is a middle option, but its value depends on whether vLLM has efficient kernels for that exact compressed format
- –For dual 3090s, the real benchmark is not file size but max context, tokens/sec, and whether speculative decoding or vision components push VRAM over the edge
// TAGS
qwen3.6-27bllminferencegpuself-hostedopen-weightsvllm
DISCOVERED
4h ago
2026-04-23
PUBLISHED
6h ago
2026-04-23
RELEVANCE
8/ 10
AUTHOR
Blues520