OPEN_SOURCE ↗
REDDIT · REDDIT// 21d agoOPENSOURCE RELEASE
Hypura runs bigger models on Macs
Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon that spreads tensors across GPU, RAM, and NVMe so models larger than local memory can still run. The open-source project is built on llama.cpp.
// ANALYSIS
This is smart systems work, not a miracle accelerator. Hypura treats NVMe as a legitimate third memory tier, which is exactly the sort of hack that makes local LLMs feel less boxed in on Apple Silicon.
- –MoE models are the sweet spot because only a subset of experts fire per token, so dormant experts can live on NVMe and be fetched on demand.
- –Dense models still pay the latency tax once FFN weights start streaming, so the win is feasibility more than raw throughput.
- –The benchmark claims are deliberately practical: 2.2 tok/s on a 31 GB Mixtral and 0.3 tok/s on a 70B model are not fast, but they turn a hard OOM into something usable.
- –The automatic placement planner, prefetch logic, and Ollama-compatible server hide the ugly parts of buffer sizing and tier assignment from users.
- –For models that already fit in memory, the project claims zero overhead, which is exactly the kind of graceful fallback you want from a scheduler.
// TAGS
hypurallminferencegpuopen-sourceself-hostedcliapi
DISCOVERED
21d ago
2026-03-22
PUBLISHED
21d ago
2026-03-22
RELEVANCE
8/ 10
AUTHOR
tbaumer22