Dual Spark owners probe llama.cpp scaling
A Reddit user running vLLM successfully on a dual-ASUS GX10 (Spark) setup asks whether llama.cpp can be used similarly for a GGUF-only MiniMax model that will not fit on a single machine. The post is essentially a practical ask for distributed inference guidance, with the model target being `llmfan46/MiniMax-M2.7-ultra-uncensored-heretic-GGUF` and the core question being whether dual Spark boxes can be combined under llama.cpp.
Hot take: this is less a “how do I launch it?” question and more a “which llama.cpp distribution path is actually viable here?” question.
- –Upstream llama.cpp does support multi-GPU on one host, and its docs cover both `layer` and experimental `tensor` split modes.
- –llama.cpp also has RPC-based distributed inference across remote hosts, but the RPC backend is explicitly described as proof-of-concept and fragile/insecure.
- –The model matters: llama.cpp’s own multi-GPU docs say `tensor` split is not implemented for `MiniMax-M2`, so the obvious “just use tensor parallelism” path is blocked for this architecture.
- –For dual Spark hardware, the realistic paths are likely layer-splitting on a single host, or RPC offload across nodes if the user is willing to accept the experimental tradeoffs.
DISCOVERED
1h ago
2026-05-21
PUBLISHED
2h ago
2026-05-21
RELEVANCE
AUTHOR
koibKop4