DGX Spark boosts multi-user agent serving
This Reddit benchmark post compares several Qwen3.6-35B-A3B serving setups on NVIDIA DGX Spark for agentic, multi-user usage. The author says Atlas is effectively out after tool-calling failures, then reports stronger results from RedHatAI/Qwen3.6-35B-A3B-NVFP4 on vLLM: roughly 51 tps single-stream at about 30k context and 5000 output tokens, and about 139 aggregate tps across four concurrent requests, with a 77.8% MTP draft acceptance rate.
Strong signal for people trying to run shared agent workloads locally: DGX Spark is viable, but the inference stack is still the real bottleneck. The key datapoint is not just single-stream throughput; the NVFP4 setup scales materially better under four-way concurrency than the AWQ setup. Tool-calling reliability matters more than headline TPS for agent use, and the author’s Atlas experience shows that a faster stack can still be unusable if function calling breaks. The posted vLLM flags are unusually informative for reproducibility, which makes this a useful benchmark post rather than just anecdotal bragging. For multi-user agent services, the numbers imply DGX Spark can support meaningful concurrent traffic, but model format, speculative decoding, context handling, and tool parser stability will determine whether it is production-useful.
DISCOVERED
3h ago
2026-05-23
PUBLISHED
5h ago
2026-05-23
RELEVANCE
AUTHOR
totosse17
