OPEN_SOURCE ↗
REDDIT · REDDIT// 2h agoBENCHMARK RESULT
RWKU batch-size swing flags eval bug
A Reddit post reports that Llama 3.2 1B Instruct scores about 47.3 on RWKU utility_general at batch size 1, but drops to 29.7 when evaluated at batch size 4, with a similar collapse on utility_reason for a 3-shot setup. Since benchmark accuracy should not materially change just because batch size changes, the post strongly suggests a batching, padding, masking, truncation, or result-alignment issue in the evaluation harness rather than an actual model-quality problem.
// ANALYSIS
Hot take: a large batch-size swing on a static multiple-choice benchmark is almost always an implementation bug.
- –Causal LMs can break under batching if padding side, attention masks, or position ids are handled incorrectly.
- –Batched generation can mis-score if outputs are mapped back to examples by the wrong index after collation or sorting.
- –Prompt truncation, stopping criteria, or tokenization differences between single-item and multi-item batches can change exact-match metrics a lot.
- –The fact that both utility_general and utility_reason fall off points to a shared eval-path problem, not a dataset-specific weakness.
// TAGS
rwkullama-3.2benchmarkingevaluationbatchingllm-inferenceunlearning
DISCOVERED
2h ago
2026-04-20
PUBLISHED
4h ago
2026-04-20
RELEVANCE
7/ 10
AUTHOR
SwimmingMedical6693