BACK_TO_FEEDAICRIER_2
SoloHeaven cuts Apple Silicon LLM wait 200x
OPEN_SOURCE ↗
REDDIT · REDDIT// 28d agoOPENSOURCE RELEASE

SoloHeaven cuts Apple Silicon LLM wait 200x

MLX SoloHeaven is an open-source (MIT) local inference server for Apple Silicon that reuses KV cache across conversation turns, reducing time-to-first-token from 126s to 0.5s at 100K+ context — a 200x improvement. Built on Apple's MLX framework, it ships with an OpenAI-compatible API, disk-persisted cache, web UI, and support for Qwen3.5 hybrid attention models.

// ANALYSIS

For anyone running large models locally on Apple Silicon, this is the most practical long-context inference optimization to ship in a while — and the benchmark data is unusually rigorous.

  • The core insight is obvious in retrospect: stop re-processing the entire conversation every turn. At 100K context on an M3 Ultra, that's the difference between 2 minutes and half a second per response.
  • Thinking token preservation is the non-obvious finding: trimming `<think>` tokens from the cache caused 31% longer outputs and quality regression, because Qwen3.5 references past reasoning across turns via hybrid DeltaNet attention layers.
  • KV 8-bit quantization was benchmarked and rejected — a 16.5% TPS drop with minimal memory savings, useful data point for the community debating quantized KV caches.
  • Disk persistence means the cache survives server restarts, which changes the economics of long coding or agent sessions — the "cold start" penalty only hits once.
  • Real-world numbers from a 191-message coding session show 89% cache hit rate and 11.8M tokens saved — this is not a toy benchmark.
// TAGS
mlx-soloheavenllminferenceedge-aiopen-sourceself-hosted

DISCOVERED

28d ago

2026-03-15

PUBLISHED

28d ago

2026-03-15

RELEVANCE

8/ 10

AUTHOR

Present-Mirror-6706