BACK_TO_FEEDAICRIER_2
llama.cpp benchmark sparks 96GB RAM debate
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT

llama.cpp benchmark sparks 96GB RAM debate

The post benchmarks Qwen3.5-35B-A3B in llama.cpp on a Ryzen 7 7700 with 32GB DDR5 and an RTX 5060 Ti 16GB, then asks whether moving to 96GB system RAM would make larger sparse-MoE models worth the cost. The real question is less about raw speed and more about whether extra memory unlocks meaningfully better local models without hitting SSD-bound inference.

// ANALYSIS

Hot take: the upgrade is useful if the goal is to explore bigger local MoE models, but the 50 t/s extrapolation is optimistic and 100B-class models usually buy more breadth and consistency than a dramatic jump in intelligence.

  • The 50 t/s baseline comes from a 3B-active MoE case; scaling that linearly to 10B active parameters ignores cache pressure, routing overhead, and memory bandwidth limits.
  • 96GB matters most when it keeps the full quantized model resident in RAM and avoids paging or disk involvement, which is what usually breaks local inference UX.
  • For many users, 35B-class models are still the sweet spot; 100B-class models improve world knowledge and robustness, but the gains diminish quickly relative to cost, heat, and power.
  • Modern sparse MoE models are the right place to spend RAM first, because they can feel much larger than their active parameter count suggests while staying locally runnable.
// TAGS
local-llmllama-cppqwen3.5moeram-upgradebenchmarkingcpu-offloadgpu-offloadinference

DISCOVERED

3h ago

2026-04-25

PUBLISHED

6h ago

2026-04-24

RELEVANCE

8/ 10

AUTHOR

UncleRedz