BACK_TO_FEEDAICRIER_2
Optane PMem build runs 1 trillion parameter LLM locally
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoINFRASTRUCTURE

Optane PMem build runs 1 trillion parameter LLM locally

A specialized local build featuring 768GB of secondhand Intel Optane Persistent Memory and an RTX 3060 has successfully run the 1.04 trillion parameter Kimi K2.5 model at roughly 5 tokens per second. By leveraging the sparse Mixture-of-Experts architecture and llama.cpp's hybrid offloading, the project achieves frontier-class inference on a hardware budget far below traditional GPU-heavy alternatives.

// ANALYSIS

MoE architectures combined with tiered memory are making 1T+ parameter models viable for hobbyists, effectively bypassing the "VRAM tax" for large-scale reasoning.

  • Intel's discontinued PMem modules provide a high-bandwidth, low-latency middle ground between DRAM and SSDs, ideal for sparse expert offloading.
  • This build demonstrates that memory capacity, not just FLOPs, is the primary hurdle for local frontier LLM deployment.
  • Software optimizations like Unsloth's dynamic quants are essential for fitting 1T models into sub-1TB memory footprints.
  • The 5 t/s performance milestone proves that expensive H100 clusters aren't the only way to achieve acceptable inference speeds for research.
// TAGS
llminferencegpuself-hostedintel-optanekimi-k2.5unslothmoe

DISCOVERED

3h ago

2026-04-15

PUBLISHED

3h ago

2026-04-15

RELEVANCE

8/ 10

AUTHOR

APFrisco