BACK_TO_FEEDAICRIER_2
Atlas pushes GB10 inference past 115 tok/s
OPEN_SOURCE ↗
REDDIT · REDDIT// 32d agoINFRASTRUCTURE

Atlas pushes GB10 inference past 115 tok/s

Atlas, a pure Rust LLM inference engine for NVIDIA DGX Spark and GB10 systems, says its new Qwen3.5-35B container reaches roughly 115 tokens per second with speculative decoding and NVFP4 optimizations. The release matters because it positions Atlas as a faster, OpenAI-compatible alternative to stock vLLM images for local high-end inference workloads.

// ANALYSIS

Atlas is interesting because it is not just another benchmark post — it is an attempt to own the full local inference stack on DGX Spark and turn niche hardware into a serious developer platform.

  • The headline claim is the 3.1x speedup over the community-standard vLLM image, which is a big enough jump to matter for anyone serving local models interactively
  • Atlas is pitching operational simplicity as much as raw speed: pure Rust, no Python stack, OpenAI-compatible serving, and a container that should be runnable in minutes
  • The roadmap broadens the story beyond one model, with Qwen3.5-122B, Nemotron, ASUS Ascent GX10, and even Strix Halo mentioned as next targets
  • The biggest caveat is trust: community reaction on NVIDIA’s forum has already pushed for reproducible benchmarks and open source code before treating Atlas as a new default
  • If the team follows through on broader hardware support and a credible open-source release, Atlas could become one of the more important local inference projects around GB10-class systems
// TAGS
atlasllminferencegpuself-hostedapi

DISCOVERED

32d ago

2026-03-10

PUBLISHED

36d ago

2026-03-07

RELEVANCE

8/ 10

AUTHOR

Live-Possession-6726