OPEN_SOURCE ↗
REDDIT · REDDIT// 22d agoBENCHMARK RESULT
MiniMax M2.5 tops 62 tok/s on M5 Max
A Reddit user reports running MiniMax M2.5 locally on an Apple M5 Max with 128GB unified memory, using a Q3 quantized GGUF and llama.cpp's built-in `llama-server`. They claim about 62 tokens per second at 16k context and say the setup is OpenAI-compatible, with the box also exposed as a public API.
// ANALYSIS
This reads less like hype and more like a useful proof point: a 230B-class open-weights model is starting to feel practical on serious consumer hardware if you’re willing to quantize aggressively and accept some tradeoffs.
- –62 tok/s on a laptop-class machine is genuinely impressive, but the `UD-Q3_K_XL` quantization is doing a lot of the heavy lifting here.
- –The OpenAI-compatible server shape matters as much as the raw speed, because it makes the model much easier to drop into agent and app workflows.
- –MiniMax’s own docs position M2.5 for local/private deployment, so this post is a real-world validation of that story rather than a one-off stunt.
- –The public API angle turns a personal workstation into a shareable inference endpoint, but production users would still want to validate latency consistency, memory headroom, and access controls.
// TAGS
minimax-m2-5llmopen-weightsself-hostedinferenceedge-aiapi
DISCOVERED
22d ago
2026-03-21
PUBLISHED
22d ago
2026-03-21
RELEVANCE
9/ 10
AUTHOR
Equivalent-Buy1706