OPEN_SOURCE ↗
REDDIT · REDDIT// 4d agoINFRASTRUCTURE
LocalLLaMA thread weighs cheap LLM rigs
A r/LocalLLaMA post asks what hardware can realistically push quantized 70B-120B models at 5-10 tok/s beyond a single RTX 3090. The short answer from commenters is that more 3090s remain the value play, while RAM offload helps capacity more than speed.
// ANALYSIS
This is mostly a bandwidth problem disguised as a budget problem: if you want 70B-class models to feel fast, you need memory bandwidth, not just more gigabytes.
- –NVIDIA’s RTX 3090 gives you 24GB of GDDR6X on a 384-bit bus, which is why it’s still the hobbyist default for local inference
- –NVIDIA’s A100/H100 lines are built around HBM and much higher bandwidth, with H100 advertising 3 TB/s and H100 NVL explicitly pitched for Llama 2 70B inference
- –Offloading weights to system RAM can make a model fit, but that is a capacity hack, not a speed hack; the PCIe/DDR gap will usually crush token rate once the GPU is waiting on host memory
- –The practical “cheap hardware adventure” is usually more used 3090s, used datacenter cards, or smaller quantizations rather than a clever RAM trick
- –The thread’s answer is blunt but accurate: for 5-10 tok/s on 70-120B, the realistic ceiling is closer to HBM-class hardware than consumer-board ingenuity
// TAGS
local-llamallmgpuinferenceself-hosted
DISCOVERED
4d ago
2026-04-07
PUBLISHED
4d ago
2026-04-07
RELEVANCE
7/ 10
AUTHOR
Clean_Archer8374