REDDIT · REDDIT// 32d agoBENCHMARK RESULT

rolv touts 20.7x Llama 4 speedup

rolv says its rolvsparse library beat cuBLAS on a real Llama 4 Maverick MoE expert weight pulled from Hugging Face, pushing throughput from 369K to 7.66M tokens/s on an NVIDIA B200 while cutting time to first token from 64.8ms to 0.37ms. The company’s pitch is that it can skip provably zero compute in sparse expert projections without changing outputs, turning MoE latency and energy efficiency into an infrastructure advantage rather than a model-quality tradeoff.

// ANALYSIS

If these numbers hold up outside vendor-controlled benchmarks, ROLV is attacking one of the most valuable choke points in modern inference: first-token latency on giant MoE models.

–The most important claim is not raw tokens per second but the 177x TTFT reduction, because that is what users actually feel in interactive inference.
–The benchmark is more credible than a toy sparse-matrix demo because it uses a real Llama 4 Maverick weight tensor and publishes matching figures on the company site, not just a synthetic workload.
–ROLV is positioning itself as infrastructure middleware, not a new model stack: same hardware, same math, lower compute waste.
–The obvious caveat is that this is still a company-published benchmark, so buyers will want independent reproduction on broader end-to-end serving workloads, not just isolated matrix kernels.

// TAGS

rolvllminferencegpubenchmark

DISCOVERED

32d ago

2026-03-11

PUBLISHED

33d ago

2026-03-09

RELEVANCE

8/ 10

AUTHOR

Norwayfund