BACK_TO_FEEDAICRIER_2
Qwen3.5-35B-A3B Hits 120 tok/s on H200
OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoBENCHMARK RESULT

Qwen3.5-35B-A3B Hits 120 tok/s on H200

This Reddit post is a sanity check from someone running Qwen3.5-35B-A3B on an H200 through vLLM with AWQ quantization and extra setup work around an older driver/CUDA stack. They are seeing about 120 tokens per second and suspect the deployment is misconfigured. Public benchmark data for a tuned 1x H200 setup puts this model closer to roughly 200+ tok/s for single-request, short-context inference and much higher aggregate throughput under load, so 120 tok/s is below what you would normally expect from a healthy configuration.

// ANALYSIS

Hot take: the GPU is probably not the bottleneck here; the stack is. 120 tok/s is not catastrophic, but it is low enough that I would treat it as a configuration or kernel-path problem before blaming the H200.

  • Benchmarks on 1x H200 SXM for Qwen3.5-35B-A3B FP8 show about 212.5 tok/s at 1K context for a single request, and around 223 tok/s at 8K context in a chatbot-oriented profile.
  • The same H200 benchmark scales to far higher aggregate throughput under concurrency, which means 120 tok/s on a single stream is leaving a lot on the table.
  • The user’s improvised Singularity build, driver/CUDA mismatch, and AWQ/vLLM/MTP stack are all credible places to lose performance.
  • If the context is long, prefix caching is off, or the wrong backend/kernel path is being used, 120 tok/s becomes more believable, but it still reads as suboptimal for this hardware.
  • The most useful next step is to compare against a known-good vLLM FP8 or well-validated AWQ setup on the same machine before tuning anything model-specific.
// TAGS
qwen3.5h200vllmawqllm-servinginferencebenchmarksingularitycuda

DISCOVERED

5h ago

2026-04-30

PUBLISHED

8h ago

2026-04-30

RELEVANCE

8/ 10

AUTHOR

Theio666