REDDIT · REDDIT// 3h agoINFRASTRUCTURE

Jetson AGX Orin eyes 40B LLM load

A LocalLLaMA user asks whether a used Jetson AGX Orin 64GB Dev Kit can run 30B-40B local LLMs for up to four users with the lowest possible power draw. The target is ambitious: 8-15 tokens per second per user, which pushes the problem from “can it fit?” into “can it serve enough throughput?”

// ANALYSIS

The Orin is a strong efficiency play, but this is more a multi-user serving question than a raw memory question. It can plausibly host a heavily quantized 30B model, but four concurrent users at the requested speed is likely beyond what a 60W edge SoC can sustain in practice.

–NVIDIA lists the Jetson AGX Orin 64GB at up to 275 TOPS and 15W-60W power, so the power envelope is excellent for always-on inference.
–A 30B GGUF quant can fit in roughly 20GB at Q4_K_M, but that leaves only part of the 64GB unified memory for KV cache, runtime overhead, and concurrency.
–The real bottleneck is throughput: 4 users at 8-15 tok/s each means roughly 32-60 tok/s aggregate, which is much closer to desktop-GPU territory than embedded-edge territory.
–If the goal is lowest wattage per token, Orin is attractive; if the goal is four-person interactive serving, a used discrete GPU box will usually beat it on tokens/sec per euro.
–The thread is useful as a reality check: “efficient” hardware and “shared low-latency serving” are not the same optimization problem.

// TAGS

nvidia-jetson-agx-orin-64gb-developer-kitllminferencegpuedge-aiself-hosted

DISCOVERED

3h ago

2026-04-16

PUBLISHED

4h ago

2026-04-16

RELEVANCE

6/ 10

AUTHOR

Jezel123