BACK_TO_FEEDAICRIER_2
RWKU batch-size swing flags eval bug
OPEN_SOURCE ↗
REDDIT · REDDIT// 2h agoBENCHMARK RESULT

RWKU batch-size swing flags eval bug

A Reddit post reports that Llama 3.2 1B Instruct scores about 47.3 on RWKU utility_general at batch size 1, but drops to 29.7 when evaluated at batch size 4, with a similar collapse on utility_reason for a 3-shot setup. Since benchmark accuracy should not materially change just because batch size changes, the post strongly suggests a batching, padding, masking, truncation, or result-alignment issue in the evaluation harness rather than an actual model-quality problem.

// ANALYSIS

Hot take: a large batch-size swing on a static multiple-choice benchmark is almost always an implementation bug.

  • Causal LMs can break under batching if padding side, attention masks, or position ids are handled incorrectly.
  • Batched generation can mis-score if outputs are mapped back to examples by the wrong index after collation or sorting.
  • Prompt truncation, stopping criteria, or tokenization differences between single-item and multi-item batches can change exact-match metrics a lot.
  • The fact that both utility_general and utility_reason fall off points to a shared eval-path problem, not a dataset-specific weakness.
// TAGS
rwkullama-3.2benchmarkingevaluationbatchingllm-inferenceunlearning

DISCOVERED

2h ago

2026-04-20

PUBLISHED

4h ago

2026-04-20

RELEVANCE

7/ 10

AUTHOR

SwimmingMedical6693