OPEN_SOURCE ↗
REDDIT · REDDIT// 32d agoINFRASTRUCTURE
Atlas pushes GB10 inference past 115 tok/s
Atlas, a pure Rust LLM inference engine for NVIDIA DGX Spark and GB10 systems, says its new Qwen3.5-35B container reaches roughly 115 tokens per second with speculative decoding and NVFP4 optimizations. The release matters because it positions Atlas as a faster, OpenAI-compatible alternative to stock vLLM images for local high-end inference workloads.
// ANALYSIS
Atlas is interesting because it is not just another benchmark post — it is an attempt to own the full local inference stack on DGX Spark and turn niche hardware into a serious developer platform.
- –The headline claim is the 3.1x speedup over the community-standard vLLM image, which is a big enough jump to matter for anyone serving local models interactively
- –Atlas is pitching operational simplicity as much as raw speed: pure Rust, no Python stack, OpenAI-compatible serving, and a container that should be runnable in minutes
- –The roadmap broadens the story beyond one model, with Qwen3.5-122B, Nemotron, ASUS Ascent GX10, and even Strix Halo mentioned as next targets
- –The biggest caveat is trust: community reaction on NVIDIA’s forum has already pushed for reproducible benchmarks and open source code before treating Atlas as a new default
- –If the team follows through on broader hardware support and a credible open-source release, Atlas could become one of the more important local inference projects around GB10-class systems
// TAGS
atlasllminferencegpuself-hostedapi
DISCOVERED
32d ago
2026-03-10
PUBLISHED
36d ago
2026-03-07
RELEVANCE
8/ 10
AUTHOR
Live-Possession-6726