OPEN_SOURCE ↗
REDDIT · REDDIT// 10h agoBENCHMARK RESULT
llama.cpp benchmarks jump on RTX 6000 Server
Rebuilding llama.cpp from b8198 to d05fe1d on the same RTX 6000 Server setup produced large gains on Qwen3.5-122B-A10B MXFP4_MOE, especially in prompt processing and long-context TTFT. The update appears to come from newer Blackwell/NVFP4 paths, prompt cache, and MoE/MXFP4 kernel work rather than any hardware or model change.
// ANALYSIS
This is the kind of infra uplift that matters more than a flashy model swap: the same box is simply doing more useful work per second.
- –Prompt processing improved the most, with pp512 up 45% and 8K-65K prompt throughput up 59-69%, pointing to kernel and tensor-core path gains.
- –Decode throughput also moved meaningfully, but less dramatically, rising 28-35% at depth while keeping the slowdown curve roughly intact.
- –Concurrency-2 per-request throughput jumped 60%, which suggests batching and slot scheduling are better than before, not just single-stream latency wins.
- –TTFT fell 23-32% at long context lengths, so the new build is materially better for interactive use, not just benchmark charts.
- –Speculative decode is present in the build, but the reported draft-model test was net-negative, so the practical win here is the core inference path.
// TAGS
llama-cppbenchmarkinferencegpumoequantizationopen-sourcellm
DISCOVERED
10h ago
2026-05-03
PUBLISHED
11h ago
2026-05-02
RELEVANCE
8/ 10
AUTHOR
laziz