BACK_TO_FEEDAICRIER_2
GLM-4.7 Flash tops local speed test
OPEN_SOURCE ↗
REDDIT · REDDIT// 32d agoBENCHMARK RESULT

GLM-4.7 Flash tops local speed test

A LocalLLaMA benchmark on a Ryzen 5 3600X and RTX 3090 found GLM-4.7 Flash dramatically faster than two quantized Qwen reasoning models, reaching 96.68 tokens/s on short prompts and 65.08 tokens/s at 32K context. The poster did not test output quality, so this is a latency and throughput result rather than a full ranking of model usefulness.

// ANALYSIS

Local inference benchmarks like this are a good reminder that model choice is often gated by waiting time before it is gated by raw capability.

  • GLM-4.7 Flash posted roughly 3x the short-context throughput of the Qwen runs and far shorter thinking times, which is a big deal for agent loops and interactive coding.
  • 32K TTFT was painful on every model tested, but GLM still looked materially more usable at about 31 seconds versus roughly 41 to 55 seconds for the Qwen variants.
  • The two Qwen models stayed relatively close on throughput, suggesting MoE-style local setups can stay competitive on tokens per second even when first-token latency slips.
  • Because the test skips answer quality entirely, the practical takeaway is not “GLM wins everything,” but “GLM currently looks much better for local responsiveness on this hardware.”
// TAGS
glm-4.7-flashqwenllmbenchmarkinference

DISCOVERED

32d ago

2026-03-11

PUBLISHED

32d ago

2026-03-11

RELEVANCE

8/ 10

AUTHOR

aiko929