MiniMax M2.5 tops 62 tok/s on M5 Max

// 114d agoBENCHMARK RESULT

MiniMax M2.5 tops 62 tok/s on M5 Max

A Reddit user reports running MiniMax M2.5 locally on an Apple M5 Max with 128GB unified memory, using a Q3 quantized GGUF and llama.cpp's built-in `llama-server`. They claim about 62 tokens per second at 16k context and say the setup is OpenAI-compatible, with the box also exposed as a public API.

// ANALYSIS

This reads less like hype and more like a useful proof point: a 230B-class open-weights model is starting to feel practical on serious consumer hardware if you’re willing to quantize aggressively and accept some tradeoffs.

–62 tok/s on a laptop-class machine is genuinely impressive, but the `UD-Q3_K_XL` quantization is doing a lot of the heavy lifting here.
–The OpenAI-compatible server shape matters as much as the raw speed, because it makes the model much easier to drop into agent and app workflows.
–MiniMax’s own docs position M2.5 for local/private deployment, so this post is a real-world validation of that story rather than a one-off stunt.
–The public API angle turns a personal workstation into a shareable inference endpoint, but production users would still want to validate latency consistency, memory headroom, and access controls.

// TAGS

minimax-m2-5llmopen-weightsself-hostedinferenceedge-aiapi

DISCOVERED

114d ago

2026-03-21

PUBLISHED

114d ago

2026-03-21

RELEVANCE

9/ 10

AUTHOR

Equivalent-Buy1706

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE13m ago

Win11Debloat declutters Windows 10 and 11

Win11Debloat is a lightweight, customizable PowerShell script to declutter, optimize, and customize Windows 10 and 11. It allows users to remove pre-installed bloatware apps, disable telemetry, adjust privacy settings, and tweak user interface elements through an interactive menu or command-line arguments.

LAUNCH30m ago

Odingard launches Cerberus runtime security engine

Cerberus by Odingard Security is a runtime security engine for AI agents that mitigates security risks by intercepting tool calls at the tool boundary. It specifically protects production systems against the "Lethal Trifecta"—the convergence of sensitive data access, untrusted content processing, and outbound communication channels.

RESEARCH39m ago

Smart Cellular Bricks achieve decentralized self-repair

A new Nature Communications paper by researchers from the IT University of Copenhagen, Sakana AI, and Autodesk introduces Smart Cellular Bricks, a modular 3D system capable of shape classification and self-repair. Running a decentralized Neural Cellular Automata model, the individual bricks communicate only with immediate neighbors to collectively coordinate recovery without a central controller.