MiniMax M2.7 hits 100k on Strix Halo
This post shares a hard-won local inference setup for pushing MiniMax M2.7 to 100k context on Strix Halo using `llama-server`, along with the exact flags that made it stable: no context shifting, no mmap, unified KV cache, VRAM-only cache, and larger batch sizes for prefill. It also includes deployment notes for headless Fedora and swap/OOM tuning, plus a candid read on the model’s strengths: strong coding intuition and intent-following, but weaker architecture/code-review judgment than Qwen3.6 27B.
Hot take: the real value here is not the benchmark screenshots, it’s the operating playbook for making a long-context open model behave on constrained local hardware.
- –The configuration is the core contribution: `--no-context-shift`, `--kv-unified`, `--cache-ram 0`, and `-b/-ub 1024` are the knobs that matter most for stability and throughput.
- –The post is useful because it separates what is necessary from what is optional, including the author’s warning that `--cache-reuse 256` can help or hurt depending on workload.
- –The hardware angle is narrow but valuable: Strix Halo plus aggressive tuning makes 100k context feel like a reproducible local setup instead of a lab demo.
- –The model comparison is nuanced rather than hype-driven: MiniMax is framed as better at “intent” and coding intuition, while Qwen3.6 27B still wins on broader reasoning and review quality.
DISCOVERED
4h ago
2026-05-10
PUBLISHED
7h ago
2026-05-09
RELEVANCE
AUTHOR
Zc5Gwu