BACK_TO_FEEDAICRIER_2
Thunderbolt eGPU slows MiniMax inference
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT

Thunderbolt eGPU slows MiniMax inference

A LocalLLaMA user found that adding an RTX 3060 over Thunderbolt to a dual RTX 3090 MiniMax setup made inference worse, dropping generation from 25.19 to 24.35 tokens/sec and prompt processing from 30.37 to 20.70 tokens/sec. The result underlines a hard local-inference reality: extra VRAM is not automatically useful when the interconnect becomes the bottleneck.

// ANALYSIS

This is a useful anti-benchmark for home AI rigs: the weakest link is not always raw GPU compute, it is how often the runtime has to move activations, KV cache, and layer data across slow links.

  • Thunderbolt eGPU bandwidth and latency can erase the benefit of moving a small model slice out of system RAM.
  • Prompt processing suffers more than generation because prefill stresses memory movement and inter-GPU coordination harder.
  • PCIe x1 multi-GPU setups may still work for mostly sequential layer offload, but they are risky for large-context, multi-GPU llama.cpp workloads.
  • For local MiniMax-class MoE inference, more system RAM or fewer faster-connected GPUs may beat a larger pile of lane-starved cards.
  • The broader lesson for builders is to benchmark topology, not just VRAM totals.
// TAGS
minimax-m2-7minimax-m2-5llminferencegpuself-hostedbenchmark

DISCOVERED

4h ago

2026-04-21

PUBLISHED

6h ago

2026-04-21

RELEVANCE

7/ 10

AUTHOR

SnooPaintings8639