YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Hypura runs bigger models on Macs

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Hypura runs bigger models on Macs
OPEN LINK ↗
// 67d agoOPENSOURCE RELEASE

Hypura runs bigger models on Macs

Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon that spreads tensors across GPU, RAM, and NVMe so models larger than local memory can still run. The open-source project is built on llama.cpp.

// ANALYSIS

This is smart systems work, not a miracle accelerator. Hypura treats NVMe as a legitimate third memory tier, which is exactly the sort of hack that makes local LLMs feel less boxed in on Apple Silicon.

  • MoE models are the sweet spot because only a subset of experts fire per token, so dormant experts can live on NVMe and be fetched on demand.
  • Dense models still pay the latency tax once FFN weights start streaming, so the win is feasibility more than raw throughput.
  • The benchmark claims are deliberately practical: 2.2 tok/s on a 31 GB Mixtral and 0.3 tok/s on a 70B model are not fast, but they turn a hard OOM into something usable.
  • The automatic placement planner, prefetch logic, and Ollama-compatible server hide the ugly parts of buffer sizing and tier assignment from users.
  • For models that already fit in memory, the project claims zero overhead, which is exactly the kind of graceful fallback you want from a scheduler.
// TAGS
hypurallminferencegpuopen-sourceself-hostedcliapi

DISCOVERED

67d ago

2026-03-22

PUBLISHED

67d ago

2026-03-22

RELEVANCE

8/ 10

AUTHOR

tbaumer22