OPEN_SOURCE ↗
REDDIT · REDDIT// 31d agoBENCHMARK RESULT
rolv clocks Llama 4 Scout attention
rolv reports that its operator now preserves the same canonical hash across all 32 self-attention QKV layers in Llama 4 Scout, extending earlier MoE-focused benchmarks to a different transformer block. On an NVIDIA B200, the post claims 4.4x to 10.4x per-iteration speedups over cuBLAS, 413 mean TFLOPS, and 86.8% mean energy savings on stacked QKV projections loaded from Hugging Face weights.
// ANALYSIS
This is a useful benchmark because it moves rolv from giant sparse MoE layers into a smaller, more deployment-relevant attention path where brute-force dense kernels already have less room to lose.
- –The headline result is not the raw 8.3x mean speedup, but that the same canonical hash held across every one of Scout’s 32 attention layers with zero failures.
- –Attention QKV projections are structurally different from the MoE FFN layers rolv highlighted before, so this broadens the claim that the method generalizes across transformer components.
- –The lower gain versus prior 55-82x MoE numbers is believable on its face because the matrix here is much smaller, making it a better stress test of whether the technique still matters outside blockbuster sparse cases.
- –For AI infra readers, this lands as a benchmark story more than a product launch: promising if reproducible, but still best treated as vendor-published performance evidence rather than settled consensus.
// TAGS
rolvbenchmarkinferencellmgpu
DISCOVERED
31d ago
2026-03-11
PUBLISHED
33d ago
2026-03-10
RELEVANCE
7/ 10
AUTHOR
Norwayfund