YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Qwen3.5-35B-A3B Hits 120 tok/s on H200

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Qwen3.5-35B-A3B Hits 120 tok/s on H200
OPEN LINK ↗
// 45d agoBENCHMARK RESULT

Qwen3.5-35B-A3B Hits 120 tok/s on H200

This Reddit post is a sanity check from someone running Qwen3.5-35B-A3B on an H200 through vLLM with AWQ quantization and extra setup work around an older driver/CUDA stack. They are seeing about 120 tokens per second and suspect the deployment is misconfigured. Public benchmark data for a tuned 1x H200 setup puts this model closer to roughly 200+ tok/s for single-request, short-context inference and much higher aggregate throughput under load, so 120 tok/s is below what you would normally expect from a healthy configuration.

// ANALYSIS

Hot take: the GPU is probably not the bottleneck here; the stack is. 120 tok/s is not catastrophic, but it is low enough that I would treat it as a configuration or kernel-path problem before blaming the H200.

  • Benchmarks on 1x H200 SXM for Qwen3.5-35B-A3B FP8 show about 212.5 tok/s at 1K context for a single request, and around 223 tok/s at 8K context in a chatbot-oriented profile.
  • The same H200 benchmark scales to far higher aggregate throughput under concurrency, which means 120 tok/s on a single stream is leaving a lot on the table.
  • The user’s improvised Singularity build, driver/CUDA mismatch, and AWQ/vLLM/MTP stack are all credible places to lose performance.
  • If the context is long, prefix caching is off, or the wrong backend/kernel path is being used, 120 tok/s becomes more believable, but it still reads as suboptimal for this hardware.
  • The most useful next step is to compare against a known-good vLLM FP8 or well-validated AWQ setup on the same machine before tuning anything model-specific.
// TAGS
qwen3.5h200vllmawqllm-servinginferencebenchmarksingularitycuda

DISCOVERED

45d ago

2026-04-30

PUBLISHED

45d ago

2026-04-30

RELEVANCE

8/ 10

AUTHOR

Theio666