REDDIT · REDDIT// 10h agoBENCHMARK RESULT

llama.cpp benchmarks jump on RTX 6000 Server

Rebuilding llama.cpp from b8198 to d05fe1d on the same RTX 6000 Server setup produced large gains on Qwen3.5-122B-A10B MXFP4_MOE, especially in prompt processing and long-context TTFT. The update appears to come from newer Blackwell/NVFP4 paths, prompt cache, and MoE/MXFP4 kernel work rather than any hardware or model change.

// ANALYSIS

This is the kind of infra uplift that matters more than a flashy model swap: the same box is simply doing more useful work per second.

–Prompt processing improved the most, with pp512 up 45% and 8K-65K prompt throughput up 59-69%, pointing to kernel and tensor-core path gains.
–Decode throughput also moved meaningfully, but less dramatically, rising 28-35% at depth while keeping the slowdown curve roughly intact.
–Concurrency-2 per-request throughput jumped 60%, which suggests batching and slot scheduling are better than before, not just single-stream latency wins.
–TTFT fell 23-32% at long context lengths, so the new build is materially better for interactive use, not just benchmark charts.
–Speculative decode is present in the build, but the reported draft-model test was net-negative, so the practical win here is the core inference path.

// TAGS

llama-cppbenchmarkinferencegpumoequantizationopen-sourcellm

DISCOVERED

10h ago

2026-05-03

PUBLISHED

11h ago

2026-05-02

RELEVANCE

8/ 10

AUTHOR

laziz