BACK_TO_FEEDAICRIER_2
PersonaPlex 7B leaks memory on Apple Silicon
OPEN_SOURCE ↗
REDDIT · REDDIT// 4d agoMODEL RELEASE

PersonaPlex 7B leaks memory on Apple Silicon

A Reddit user reports that PersonaPlex 7B can run in real time on an M5 Max in MLX, but the full-duplex Apple Silicon path quickly exhausts unified memory and crashes inside MLX’s arena allocator when the KV cache grows via repeated concatenation. A preallocated cache avoids the leak but makes inference too slow for the real-time target, and the official NVIDIA server does not appear to offer a practical MPS route. The post frames the core question as whether this is a solvable MLX implementation issue or a sign that the model really needs CUDA-class hardware for stable full-duplex use.

// ANALYSIS

The reported failure looks more like an MLX runtime issue than a model-quality problem, with repeated concat in the KV cache likely driving the memory growth. The tradeoff is stark: the fast concat path appears to hit the memory cliff, while the safer preallocated cache is too slow for the reported real-time target. The post also suggests the Apple Silicon audio stack is still immature, since full-duplex output quality is described as poor even before the crash. If MLX’s execution model is the root cause, periodic flushes probably will not fix it; the cache layout or kernel path likely needs to change.

// TAGS
personaplexnvidiamoshimlxapple-siliconspeech-to-speechfull-duplexunified-memoryreal-time-voice

DISCOVERED

4d ago

2026-04-07

PUBLISHED

4d ago

2026-04-07

RELEVANCE

9/ 10

AUTHOR

Excellent_Koala769