OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
llama.cpp benchmark sparks 96GB RAM debate
The post benchmarks Qwen3.5-35B-A3B in llama.cpp on a Ryzen 7 7700 with 32GB DDR5 and an RTX 5060 Ti 16GB, then asks whether moving to 96GB system RAM would make larger sparse-MoE models worth the cost. The real question is less about raw speed and more about whether extra memory unlocks meaningfully better local models without hitting SSD-bound inference.
// ANALYSIS
Hot take: the upgrade is useful if the goal is to explore bigger local MoE models, but the 50 t/s extrapolation is optimistic and 100B-class models usually buy more breadth and consistency than a dramatic jump in intelligence.
- –The 50 t/s baseline comes from a 3B-active MoE case; scaling that linearly to 10B active parameters ignores cache pressure, routing overhead, and memory bandwidth limits.
- –96GB matters most when it keeps the full quantized model resident in RAM and avoids paging or disk involvement, which is what usually breaks local inference UX.
- –For many users, 35B-class models are still the sweet spot; 100B-class models improve world knowledge and robustness, but the gains diminish quickly relative to cost, heat, and power.
- –Modern sparse MoE models are the right place to spend RAM first, because they can feel much larger than their active parameter count suggests while staying locally runnable.
// TAGS
local-llmllama-cppqwen3.5moeram-upgradebenchmarkingcpu-offloadgpu-offloadinference
DISCOVERED
3h ago
2026-04-25
PUBLISHED
6h ago
2026-04-24
RELEVANCE
8/ 10
AUTHOR
UncleRedz