BACK_TO_FEEDAICRIER_2
LocalLLaMA thread weighs cheap LLM rigs
OPEN_SOURCE ↗
REDDIT · REDDIT// 4d agoINFRASTRUCTURE

LocalLLaMA thread weighs cheap LLM rigs

A r/LocalLLaMA post asks what hardware can realistically push quantized 70B-120B models at 5-10 tok/s beyond a single RTX 3090. The short answer from commenters is that more 3090s remain the value play, while RAM offload helps capacity more than speed.

// ANALYSIS

This is mostly a bandwidth problem disguised as a budget problem: if you want 70B-class models to feel fast, you need memory bandwidth, not just more gigabytes.

  • NVIDIA’s RTX 3090 gives you 24GB of GDDR6X on a 384-bit bus, which is why it’s still the hobbyist default for local inference
  • NVIDIA’s A100/H100 lines are built around HBM and much higher bandwidth, with H100 advertising 3 TB/s and H100 NVL explicitly pitched for Llama 2 70B inference
  • Offloading weights to system RAM can make a model fit, but that is a capacity hack, not a speed hack; the PCIe/DDR gap will usually crush token rate once the GPU is waiting on host memory
  • The practical “cheap hardware adventure” is usually more used 3090s, used datacenter cards, or smaller quantizations rather than a clever RAM trick
  • The thread’s answer is blunt but accurate: for 5-10 tok/s on 70-120B, the realistic ceiling is closer to HBM-class hardware than consumer-board ingenuity
// TAGS
local-llamallmgpuinferenceself-hosted

DISCOVERED

4d ago

2026-04-07

PUBLISHED

4d ago

2026-04-07

RELEVANCE

7/ 10

AUTHOR

Clean_Archer8374