Qwen3.5-9B hits 4.6 t/s on Jetson Orin Nano Super
Community benchmarks for Qwen 3.5-9b (Q3_K_M) on the NVIDIA Jetson Orin Nano Super show a 4.6 tokens/s throughput using llama.cpp. The performance results highlight the memory bandwidth bottlenecks of entry-level edge hardware when running 9B parameter models without specialized inference stacks.
While 4.6 t/s is a usable baseline for local LLM tasks, the hardware is capable of significantly higher throughput with deeper optimization.
- –Memory bandwidth (102 GB/s) is the primary ceiling, mathematically limiting a 9B model pass to a theoretical max of ~20 t/s
- –Standard llama.cpp builds often underutilize Jetson's Tensor Cores; switching to MLC LLM or TensorRT-LLM could potentially double or triple these speeds
- –8GB of unified memory is tight for 9B models, requiring aggressive quantization (Q3/Q4) to leave sufficient room for the KV cache
- –Enabling "Super Mode" via nvpmodel -m 0 and locking clocks is mandatory for consistent performance at this scale
- –Developers seeking high-velocity inference should target the 2B variant, which can hit 15-20 t/s on the same hardware
DISCOVERED
80d ago
2026-03-08
PUBLISHED
83d ago
2026-03-06
RELEVANCE
AUTHOR
Otherwise-Sir7359