OPEN_SOURCE ↗
REDDIT · REDDIT// 22d agoINFRASTRUCTURE
llama.cpp DeepSeek DSA seeks 768GB VRAM
The author is looking for access to a very large multi-GPU machine to benchmark a proof-of-concept DeepSeek Sparse Attention branch of llama.cpp. The goal is to verify dense-vs-sparse behavior on DeepSeek V3.2 Speciale with lineage-bench, since the differences only show up on harder reasoning tasks.
// ANALYSIS
This is less a launch than a correctness hunt for an inference kernel that only really proves itself under brutal benchmark conditions.
- –The 768 GB VRAM ask tells you this is well past normal workstation territory and into shared-cluster or proxy-runner territory.
- –lineage-bench is a smart choice here because it stresses reasoning behavior, not just token throughput.
- –The failed 8x RTX PRO 6000 run suggests the bottleneck is memory layout for indexer tensors, not just raw compute.
- –Comparing against prior sglang fp8 runs gives a useful cross-framework sanity check if the sparse-attention patch is doing the right thing.
- –If the sparse branch matches expected quality deltas, it would be a strong signal that llama.cpp can support DeepSeek V3.2 Speciale without silently flattening its behavior.
// TAGS
llmbenchmarktestinggpuinferencedeepseekllama-cpp
DISCOVERED
22d ago
2026-03-20
PUBLISHED
22d ago
2026-03-20
RELEVANCE
8/ 10
AUTHOR
fairydreaming