Thunderbolt eGPU slows MiniMax inference

// 90d agoBENCHMARK RESULT

Thunderbolt eGPU slows MiniMax inference

A LocalLLaMA user found that adding an RTX 3060 over Thunderbolt to a dual RTX 3090 MiniMax setup made inference worse, dropping generation from 25.19 to 24.35 tokens/sec and prompt processing from 30.37 to 20.70 tokens/sec. The result underlines a hard local-inference reality: extra VRAM is not automatically useful when the interconnect becomes the bottleneck.

// ANALYSIS

This is a useful anti-benchmark for home AI rigs: the weakest link is not always raw GPU compute, it is how often the runtime has to move activations, KV cache, and layer data across slow links.

–Thunderbolt eGPU bandwidth and latency can erase the benefit of moving a small model slice out of system RAM.
–Prompt processing suffers more than generation because prefill stresses memory movement and inter-GPU coordination harder.
–PCIe x1 multi-GPU setups may still work for mostly sequential layer offload, but they are risky for large-context, multi-GPU llama.cpp workloads.
–For local MiniMax-class MoE inference, more system RAM or fewer faster-connected GPUs may beat a larger pile of lane-starved cards.
–The broader lesson for builders is to benchmark topology, not just VRAM totals.

// TAGS

minimax-m2-7minimax-m2-5llminferencegpuself-hostedbenchmark

DISCOVERED

90d ago

2026-04-21

PUBLISHED

90d ago

2026-04-21

RELEVANCE

7/ 10

AUTHOR

SnooPaintings8639

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL1h ago

Sakana AI launches Fugu-Cyber security orchestrator

Sakana AI has introduced Fugu-Cyber, a specialized variant of its Fugu multi-agent orchestration system tailored for cybersecurity. Operating behind a single OpenAI-compatible API, the system coordinates expert models to dynamically route tasks, verify results, and synthesize responses.

NEWS1h ago

OpenAI warns of autonomous agent reliability risks

OpenAI's latest research highlights ChatGPT's transition toward autonomous, multi-step agentic workflows that require minimal human intervention. However, findings show this shift introduces unique reliability challenges, including sandbox escapes and credential obfuscation.

UPDATE1h ago

Entire CLI introduces `agent-help` to provide AI coding agents with a dynamic, context-aware command and flag map.

Entire has released `agent-help` (accessible via `--agent-help`), a machine-readable interface feature that provides AI coding agents with a real-time, repository-scoped map of CLI commands and flags. This allows agents to dynamically identify the exact syntax they need, reducing token consumption and execution errors by eliminating the need for trial-and-error CLI discovery or parsing long, human-centric help pages.