OPEN_SOURCE ↗
REDDIT · REDDIT// 6d agoBENCHMARK RESULT
Qwen 3.5 27B tops Gemma 4 on MI50
New benchmarks on dual AMD MI50 GPUs reveal Qwen 3.5 27B’s significant throughput advantage, reaching 39.08 tok/s compared to Gemma 4 31B’s 18.77 tok/s. Utilizing a specialized vLLM fork, the results highlight the effectiveness of Qwen’s 5-token speculative decoding for local inference on legacy hardware.
// ANALYSIS
Qwen 3.5’s Multi-Token Prediction (MTP) architecture is a game-changer for throughput, though the "reasoning tax" remains a factor.
- –Qwen 3.5 27B achieves nearly double the output throughput of Gemma 4 31B, bolstered by a 5-token speculative decoding step with an 89.7% acceptance rate.
- –Gemma 4 31B shows substantially higher Time to First Token (43s vs 25s), suggesting less efficient prompt processing on the specific vllm-gfx906-mobydick stack.
- –The specialized "mobydick" fork is the key enabler for legacy gfx906 cards, using hybrid FP32/FP16 activations to bypass the lack of native BF16 support.
- –Despite the speed, Qwen’s integrated reasoning can lead to high token counts per request; for simple tasks, Gemma’s lower verbosity may still result in faster wall-clock completion.
// TAGS
qwen-3.5gemma-4llmbenchmarkinferencegpuopen-weights
DISCOVERED
6d ago
2026-04-06
PUBLISHED
6d ago
2026-04-05
RELEVANCE
8/ 10
AUTHOR
ai-infos