YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Thunderbolt eGPU slows MiniMax inference

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Thunderbolt eGPU slows MiniMax inference
OPEN LINK ↗
// 45d agoBENCHMARK RESULT

Thunderbolt eGPU slows MiniMax inference

A LocalLLaMA user found that adding an RTX 3060 over Thunderbolt to a dual RTX 3090 MiniMax setup made inference worse, dropping generation from 25.19 to 24.35 tokens/sec and prompt processing from 30.37 to 20.70 tokens/sec. The result underlines a hard local-inference reality: extra VRAM is not automatically useful when the interconnect becomes the bottleneck.

// ANALYSIS

This is a useful anti-benchmark for home AI rigs: the weakest link is not always raw GPU compute, it is how often the runtime has to move activations, KV cache, and layer data across slow links.

  • Thunderbolt eGPU bandwidth and latency can erase the benefit of moving a small model slice out of system RAM.
  • Prompt processing suffers more than generation because prefill stresses memory movement and inter-GPU coordination harder.
  • PCIe x1 multi-GPU setups may still work for mostly sequential layer offload, but they are risky for large-context, multi-GPU llama.cpp workloads.
  • For local MiniMax-class MoE inference, more system RAM or fewer faster-connected GPUs may beat a larger pile of lane-starved cards.
  • The broader lesson for builders is to benchmark topology, not just VRAM totals.
// TAGS
minimax-m2-7minimax-m2-5llminferencegpuself-hostedbenchmark

DISCOVERED

45d ago

2026-04-21

PUBLISHED

45d ago

2026-04-21

RELEVANCE

7/ 10

AUTHOR

SnooPaintings8639