PersonaPlex 7B leaks memory on Apple Silicon
A Reddit user reports that PersonaPlex 7B can run in real time on an M5 Max in MLX, but the full-duplex Apple Silicon path quickly exhausts unified memory and crashes inside MLX’s arena allocator when the KV cache grows via repeated concatenation. A preallocated cache avoids the leak but makes inference too slow for the real-time target, and the official NVIDIA server does not appear to offer a practical MPS route. The post frames the core question as whether this is a solvable MLX implementation issue or a sign that the model really needs CUDA-class hardware for stable full-duplex use.
The reported failure looks more like an MLX runtime issue than a model-quality problem, with repeated concat in the KV cache likely driving the memory growth. The tradeoff is stark: the fast concat path appears to hit the memory cliff, while the safer preallocated cache is too slow for the reported real-time target. The post also suggests the Apple Silicon audio stack is still immature, since full-duplex output quality is described as poor even before the crash. If MLX’s execution model is the root cause, periodic flushes probably will not fix it; the cache layout or kernel path likely needs to change.
DISCOVERED
4d ago
2026-04-07
PUBLISHED
4d ago
2026-04-07
RELEVANCE
AUTHOR
Excellent_Koala769