BACK_TO_FEEDAICRIER_2
Hypura runs bigger models on Macs
OPEN_SOURCE ↗
REDDIT · REDDIT// 21d agoOPENSOURCE RELEASE

Hypura runs bigger models on Macs

Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon that spreads tensors across GPU, RAM, and NVMe so models larger than local memory can still run. The open-source project is built on llama.cpp.

// ANALYSIS

This is smart systems work, not a miracle accelerator. Hypura treats NVMe as a legitimate third memory tier, which is exactly the sort of hack that makes local LLMs feel less boxed in on Apple Silicon.

  • MoE models are the sweet spot because only a subset of experts fire per token, so dormant experts can live on NVMe and be fetched on demand.
  • Dense models still pay the latency tax once FFN weights start streaming, so the win is feasibility more than raw throughput.
  • The benchmark claims are deliberately practical: 2.2 tok/s on a 31 GB Mixtral and 0.3 tok/s on a 70B model are not fast, but they turn a hard OOM into something usable.
  • The automatic placement planner, prefetch logic, and Ollama-compatible server hide the ugly parts of buffer sizing and tier assignment from users.
  • For models that already fit in memory, the project claims zero overhead, which is exactly the kind of graceful fallback you want from a scheduler.
// TAGS
hypurallminferencegpuopen-sourceself-hostedcliapi

DISCOVERED

21d ago

2026-03-22

PUBLISHED

21d ago

2026-03-22

RELEVANCE

8/ 10

AUTHOR

tbaumer22