LocalLLaMA thread weighs cheap LLM rigs
A r/LocalLLaMA post asks what hardware can realistically push quantized 70B-120B models at 5-10 tok/s beyond a single RTX 3090. The short answer from commenters is that more 3090s remain the value play, while RAM offload helps capacity more than speed.
This is mostly a bandwidth problem disguised as a budget problem: if you want 70B-class models to feel fast, you need memory bandwidth, not just more gigabytes.
- –NVIDIA’s RTX 3090 gives you 24GB of GDDR6X on a 384-bit bus, which is why it’s still the hobbyist default for local inference
- –NVIDIA’s A100/H100 lines are built around HBM and much higher bandwidth, with H100 advertising 3 TB/s and H100 NVL explicitly pitched for Llama 2 70B inference
- –Offloading weights to system RAM can make a model fit, but that is a capacity hack, not a speed hack; the PCIe/DDR gap will usually crush token rate once the GPU is waiting on host memory
- –The practical “cheap hardware adventure” is usually more used 3090s, used datacenter cards, or smaller quantizations rather than a clever RAM trick
- –The thread’s answer is blunt but accurate: for 5-10 tok/s on 70-120B, the realistic ceiling is closer to HBM-class hardware than consumer-board ingenuity
DISCOVERED
49d ago
2026-04-07
PUBLISHED
49d ago
2026-04-07
RELEVANCE
AUTHOR
Clean_Archer8374

