Qwen3.6-35B-A3B benchmark tops AWQ Q4
A local benchmark on Qwen3.6-35B-A3B found FP8 + MTP outperforming AWQ Q4 across serial and concurrent decode, with better latency at higher concurrency. The result suggests weight quantization alone is not a reliable proxy for real serving speed.
The interesting part here is that the serving stack matters as much as the weight format. Once MTP and other runtime optimizations enter the picture, a “heavier” precision setup can still beat a lower-bit quantized one.
- –Serial decode came out at 110 tok/s for FP8 + MTP versus 91.8 tok/s for AWQ Q4
- –At concurrency 4, FP8 + MTP cleared 400+ tok/s while Q4 landed at 248 tok/s
- –At concurrency 8, FP8 + MTP hit 484 tok/s versus 250 tok/s for Q4
- –p90 latency at concurrency 8 improved from about 5.9s to about 3.4s
- –The comparison is not perfectly apples-to-apples because the Q4 setup lacked EP and MTP, which likely explains a lot of the gap
DISCOVERED
1h ago
2026-05-08
PUBLISHED
3h ago
2026-05-08
RELEVANCE
AUTHOR
Motor_Match_621