YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

RWKU batch-size swing flags eval bug

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

RWKU batch-size swing flags eval bug
OPEN LINK ↗
// 45d agoBENCHMARK RESULT

RWKU batch-size swing flags eval bug

A Reddit post reports that Llama 3.2 1B Instruct scores about 47.3 on RWKU utility_general at batch size 1, but drops to 29.7 when evaluated at batch size 4, with a similar collapse on utility_reason for a 3-shot setup. Since benchmark accuracy should not materially change just because batch size changes, the post strongly suggests a batching, padding, masking, truncation, or result-alignment issue in the evaluation harness rather than an actual model-quality problem.

// ANALYSIS

Hot take: a large batch-size swing on a static multiple-choice benchmark is almost always an implementation bug.

  • Causal LMs can break under batching if padding side, attention masks, or position ids are handled incorrectly.
  • Batched generation can mis-score if outputs are mapped back to examples by the wrong index after collation or sorting.
  • Prompt truncation, stopping criteria, or tokenization differences between single-item and multi-item batches can change exact-match metrics a lot.
  • The fact that both utility_general and utility_reason fall off points to a shared eval-path problem, not a dataset-specific weakness.
// TAGS
rwkullama-3.2benchmarkingevaluationbatchingllm-inferenceunlearning

DISCOVERED

45d ago

2026-04-20

PUBLISHED

45d ago

2026-04-20

RELEVANCE

7/ 10

AUTHOR

SwimmingMedical6693