RWKU batch-size swing flags eval bug

// 90d agoBENCHMARK RESULT

RWKU batch-size swing flags eval bug

A Reddit post reports that Llama 3.2 1B Instruct scores about 47.3 on RWKU utility_general at batch size 1, but drops to 29.7 when evaluated at batch size 4, with a similar collapse on utility_reason for a 3-shot setup. Since benchmark accuracy should not materially change just because batch size changes, the post strongly suggests a batching, padding, masking, truncation, or result-alignment issue in the evaluation harness rather than an actual model-quality problem.

// ANALYSIS

Hot take: a large batch-size swing on a static multiple-choice benchmark is almost always an implementation bug.

–Causal LMs can break under batching if padding side, attention masks, or position ids are handled incorrectly.
–Batched generation can mis-score if outputs are mapped back to examples by the wrong index after collation or sorting.
–Prompt truncation, stopping criteria, or tokenization differences between single-item and multi-item batches can change exact-match metrics a lot.
–The fact that both utility_general and utility_reason fall off points to a shared eval-path problem, not a dataset-specific weakness.

// TAGS

rwkullama-3.2benchmarkingevaluationbatchingllm-inferenceunlearning

DISCOVERED

90d ago

2026-04-20

PUBLISHED

90d ago

2026-04-20

RELEVANCE

7/ 10

AUTHOR

SwimmingMedical6693

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

VIDEO27m ago

Croc simplifies end-to-end encrypted file transfers

Croc is an open-source Go-based CLI tool that simplifies end-to-end encrypted file and folder transfers using password-authenticated key exchange. It supports resuming interrupted transfers, directory sharing, and NAT traversal via public or self-hosted relays.

MODEL1h ago

GPT-5.6 fuels six major mathematical breakthroughs

Within a week of its launch, OpenAI's GPT-5.6 has reportedly contributed to nearly six mathematical breakthroughs, highlighting the rapid escalation of AI capabilities in solving complex mathematical problems. This marks a significant shift from December 2025, when AI first solved an obscure mathematical problem, to the present state where every new OpenAI model release is expected to yield dozens of major mathematical solutions accessible to the public.

LAUNCH2h ago

DoorDash CLI orders food from terminal

Developer Daniel Avila successfully used Anthropic's Claude Code to order a burrito using DoorDash's new CLI. During the run, the agent's Auto Mode acted as a safety guardrail by blocking the final transaction command, ensuring that financial actions require human approval.