OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoINFRASTRUCTURE
Jetson AGX Orin eyes 40B LLM load
A LocalLLaMA user asks whether a used Jetson AGX Orin 64GB Dev Kit can run 30B-40B local LLMs for up to four users with the lowest possible power draw. The target is ambitious: 8-15 tokens per second per user, which pushes the problem from “can it fit?” into “can it serve enough throughput?”
// ANALYSIS
The Orin is a strong efficiency play, but this is more a multi-user serving question than a raw memory question. It can plausibly host a heavily quantized 30B model, but four concurrent users at the requested speed is likely beyond what a 60W edge SoC can sustain in practice.
- –NVIDIA lists the Jetson AGX Orin 64GB at up to 275 TOPS and 15W-60W power, so the power envelope is excellent for always-on inference.
- –A 30B GGUF quant can fit in roughly 20GB at Q4_K_M, but that leaves only part of the 64GB unified memory for KV cache, runtime overhead, and concurrency.
- –The real bottleneck is throughput: 4 users at 8-15 tok/s each means roughly 32-60 tok/s aggregate, which is much closer to desktop-GPU territory than embedded-edge territory.
- –If the goal is lowest wattage per token, Orin is attractive; if the goal is four-person interactive serving, a used discrete GPU box will usually beat it on tokens/sec per euro.
- –The thread is useful as a reality check: “efficient” hardware and “shared low-latency serving” are not the same optimization problem.
// TAGS
nvidia-jetson-agx-orin-64gb-developer-kitllminferencegpuedge-aiself-hosted
DISCOVERED
3h ago
2026-04-16
PUBLISHED
4h ago
2026-04-16
RELEVANCE
6/ 10
AUTHOR
Jezel123