OPEN_SOURCE ↗
REDDIT · REDDIT// 32d agoBENCHMARK RESULT
GLM-4.7 Flash tops local speed test
A LocalLLaMA benchmark on a Ryzen 5 3600X and RTX 3090 found GLM-4.7 Flash dramatically faster than two quantized Qwen reasoning models, reaching 96.68 tokens/s on short prompts and 65.08 tokens/s at 32K context. The poster did not test output quality, so this is a latency and throughput result rather than a full ranking of model usefulness.
// ANALYSIS
Local inference benchmarks like this are a good reminder that model choice is often gated by waiting time before it is gated by raw capability.
- –GLM-4.7 Flash posted roughly 3x the short-context throughput of the Qwen runs and far shorter thinking times, which is a big deal for agent loops and interactive coding.
- –32K TTFT was painful on every model tested, but GLM still looked materially more usable at about 31 seconds versus roughly 41 to 55 seconds for the Qwen variants.
- –The two Qwen models stayed relatively close on throughput, suggesting MoE-style local setups can stay competitive on tokens per second even when first-token latency slips.
- –Because the test skips answer quality entirely, the practical takeaway is not “GLM wins everything,” but “GLM currently looks much better for local responsiveness on this hardware.”
// TAGS
glm-4.7-flashqwenllmbenchmarkinference
DISCOVERED
32d ago
2026-03-11
PUBLISHED
32d ago
2026-03-11
RELEVANCE
8/ 10
AUTHOR
aiko929