BACK_TO_FEEDAICRIER_2
llama.cpp benchmarks jump on RTX 6000 Server
OPEN_SOURCE ↗
REDDIT · REDDIT// 10h agoBENCHMARK RESULT

llama.cpp benchmarks jump on RTX 6000 Server

Rebuilding llama.cpp from b8198 to d05fe1d on the same RTX 6000 Server setup produced large gains on Qwen3.5-122B-A10B MXFP4_MOE, especially in prompt processing and long-context TTFT. The update appears to come from newer Blackwell/NVFP4 paths, prompt cache, and MoE/MXFP4 kernel work rather than any hardware or model change.

// ANALYSIS

This is the kind of infra uplift that matters more than a flashy model swap: the same box is simply doing more useful work per second.

  • Prompt processing improved the most, with pp512 up 45% and 8K-65K prompt throughput up 59-69%, pointing to kernel and tensor-core path gains.
  • Decode throughput also moved meaningfully, but less dramatically, rising 28-35% at depth while keeping the slowdown curve roughly intact.
  • Concurrency-2 per-request throughput jumped 60%, which suggests batching and slot scheduling are better than before, not just single-stream latency wins.
  • TTFT fell 23-32% at long context lengths, so the new build is materially better for interactive use, not just benchmark charts.
  • Speculative decode is present in the build, but the reported draft-model test was net-negative, so the practical win here is the core inference path.
// TAGS
llama-cppbenchmarkinferencegpumoequantizationopen-sourcellm

DISCOVERED

10h ago

2026-05-03

PUBLISHED

11h ago

2026-05-02

RELEVANCE

8/ 10

AUTHOR

laziz