OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT
Thunderbolt eGPU slows MiniMax inference
A LocalLLaMA user found that adding an RTX 3060 over Thunderbolt to a dual RTX 3090 MiniMax setup made inference worse, dropping generation from 25.19 to 24.35 tokens/sec and prompt processing from 30.37 to 20.70 tokens/sec. The result underlines a hard local-inference reality: extra VRAM is not automatically useful when the interconnect becomes the bottleneck.
// ANALYSIS
This is a useful anti-benchmark for home AI rigs: the weakest link is not always raw GPU compute, it is how often the runtime has to move activations, KV cache, and layer data across slow links.
- –Thunderbolt eGPU bandwidth and latency can erase the benefit of moving a small model slice out of system RAM.
- –Prompt processing suffers more than generation because prefill stresses memory movement and inter-GPU coordination harder.
- –PCIe x1 multi-GPU setups may still work for mostly sequential layer offload, but they are risky for large-context, multi-GPU llama.cpp workloads.
- –For local MiniMax-class MoE inference, more system RAM or fewer faster-connected GPUs may beat a larger pile of lane-starved cards.
- –The broader lesson for builders is to benchmark topology, not just VRAM totals.
// TAGS
minimax-m2-7minimax-m2-5llminferencegpuself-hostedbenchmark
DISCOVERED
4h ago
2026-04-21
PUBLISHED
6h ago
2026-04-21
RELEVANCE
7/ 10
AUTHOR
SnooPaintings8639