OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoINFRASTRUCTURE
Optane PMem build runs 1 trillion parameter LLM locally
A specialized local build featuring 768GB of secondhand Intel Optane Persistent Memory and an RTX 3060 has successfully run the 1.04 trillion parameter Kimi K2.5 model at roughly 5 tokens per second. By leveraging the sparse Mixture-of-Experts architecture and llama.cpp's hybrid offloading, the project achieves frontier-class inference on a hardware budget far below traditional GPU-heavy alternatives.
// ANALYSIS
MoE architectures combined with tiered memory are making 1T+ parameter models viable for hobbyists, effectively bypassing the "VRAM tax" for large-scale reasoning.
- –Intel's discontinued PMem modules provide a high-bandwidth, low-latency middle ground between DRAM and SSDs, ideal for sparse expert offloading.
- –This build demonstrates that memory capacity, not just FLOPs, is the primary hurdle for local frontier LLM deployment.
- –Software optimizations like Unsloth's dynamic quants are essential for fitting 1T models into sub-1TB memory footprints.
- –The 5 t/s performance milestone proves that expensive H100 clusters aren't the only way to achieve acceptable inference speeds for research.
// TAGS
llminferencegpuself-hostedintel-optanekimi-k2.5unslothmoe
DISCOVERED
3h ago
2026-04-15
PUBLISHED
3h ago
2026-04-15
RELEVANCE
8/ 10
AUTHOR
APFrisco