
DGX Spark, M3 Ultra split prefill, decode
This Reddit report tests exo-style disaggregated inference across a NVIDIA DGX Spark and an Apple M3 Ultra Mac Studio, pushing prefill to the Spark and decode to the Mac. The author says the setup improves prefill-heavy runs by 1.4x to 3.4x in llama.cpp and notes that mmap=0 materially improves load time and throughput.
Strong benchmark post with a practical systems angle: this is less about a single magic box and more about workload placement. The headline insight is valid: prefill is compute-bound and decode is bandwidth-bound, so splitting them across the right hardware makes sense. The reported Spark wins are meaningful, but they are anecdotal and workload-dependent; they should not be read as universal uplift. The mmap=0 tip is the kind of detail that matters for real users because it can dominate load latency and even prefill behavior. The post is most relevant to people already running local LLMs, especially those experimenting with llama.cpp, KV streaming, or mixed CUDA + Apple Silicon setups. This is not really an end-user product story; it reads more like an infrastructure benchmark note plus a promising DIY architecture pattern.
DISCOVERED
3h ago
2026-05-04
PUBLISHED
4h ago
2026-05-04
RELEVANCE
AUTHOR
-dysangel-