OPEN_SOURCE ↗
REDDIT · REDDIT// 28d agoOPENSOURCE RELEASE
SoloHeaven cuts Apple Silicon LLM wait 200x
MLX SoloHeaven is an open-source (MIT) local inference server for Apple Silicon that reuses KV cache across conversation turns, reducing time-to-first-token from 126s to 0.5s at 100K+ context — a 200x improvement. Built on Apple's MLX framework, it ships with an OpenAI-compatible API, disk-persisted cache, web UI, and support for Qwen3.5 hybrid attention models.
// ANALYSIS
For anyone running large models locally on Apple Silicon, this is the most practical long-context inference optimization to ship in a while — and the benchmark data is unusually rigorous.
- –The core insight is obvious in retrospect: stop re-processing the entire conversation every turn. At 100K context on an M3 Ultra, that's the difference between 2 minutes and half a second per response.
- –Thinking token preservation is the non-obvious finding: trimming `<think>` tokens from the cache caused 31% longer outputs and quality regression, because Qwen3.5 references past reasoning across turns via hybrid DeltaNet attention layers.
- –KV 8-bit quantization was benchmarked and rejected — a 16.5% TPS drop with minimal memory savings, useful data point for the community debating quantized KV caches.
- –Disk persistence means the cache survives server restarts, which changes the economics of long coding or agent sessions — the "cold start" penalty only hits once.
- –Real-world numbers from a 191-message coding session show 89% cache hit rate and 11.8M tokens saved — this is not a toy benchmark.
// TAGS
mlx-soloheavenllminferenceedge-aiopen-sourceself-hosted
DISCOVERED
28d ago
2026-03-15
PUBLISHED
28d ago
2026-03-15
RELEVANCE
8/ 10
AUTHOR
Present-Mirror-6706