Optane PMem build runs 1 trillion parameter LLM locally
A specialized local build featuring 768GB of secondhand Intel Optane Persistent Memory and an RTX 3060 has successfully run the 1.04 trillion parameter Kimi K2.5 model at roughly 5 tokens per second. By leveraging the sparse Mixture-of-Experts architecture and llama.cpp's hybrid offloading, the project achieves frontier-class inference on a hardware budget far below traditional GPU-heavy alternatives.
MoE architectures combined with tiered memory are making 1T+ parameter models viable for hobbyists, effectively bypassing the "VRAM tax" for large-scale reasoning.
- –Intel's discontinued PMem modules provide a high-bandwidth, low-latency middle ground between DRAM and SSDs, ideal for sparse expert offloading.
- –This build demonstrates that memory capacity, not just FLOPs, is the primary hurdle for local frontier LLM deployment.
- –Software optimizations like Unsloth's dynamic quants are essential for fitting 1T models into sub-1TB memory footprints.
- –The 5 t/s performance milestone proves that expensive H100 clusters aren't the only way to achieve acceptable inference speeds for research.
DISCOVERED
45d ago
2026-04-15
PUBLISHED
45d ago
2026-04-15
RELEVANCE
AUTHOR
APFrisco