Atlas opens Rust-CUDA GB10 inference engine
Atlas is an open-source LLM inference engine for NVIDIA DGX Spark and GB10 systems, built from scratch in Rust and CUDA with no PyTorch or Python runtime in the serving path. The launch emphasizes a small ~2.5 GB image, sub-2-minute cold starts, hand-tuned Blackwell kernels, and OpenAI/Anthropic-compatible serving for agentic tools. The announcement highlights roughly 111 tok/s sustained and 130 tok/s peak on Qwen3.5-35B, with similar throughput claims across larger Qwen, Gemma, and Nemotron variants.
Hot take: this is more compelling as a hardware-specific runtime than as another generic inference server, and that specialization is exactly why the benchmark claims matter.
- –The main technical bet is removing the usual Python/PyTorch stack and replacing it with a lean Rust + CUDA execution path.
- –The performance story looks strongest on GB10/Blackwell, where the project can exploit custom kernels for attention, MoE, GDN, and Mamba-2 rather than relying on fallback code.
- –OpenAI and Anthropic API compatibility reduces adoption friction for downstream tooling like Claude Code, Cline, OpenCode, and Open WebUI.
- –The biggest caveat is portability: if the roadmap does not translate well to other chips, Atlas stays a niche high-performance local inference stack instead of a broader platform.
DISCOVERED
4h ago
2026-05-07
PUBLISHED
7h ago
2026-05-06
RELEVANCE
AUTHOR
Live-Possession-6726