BACK_TO_FEEDAICRIER_2
Why Tokens Are Enough formalizes tokenizer entropy gap
OPEN_SOURCE ↗
REDDIT · REDDIT// 26d agoNEWS

Why Tokens Are Enough formalizes tokenizer entropy gap

Doug’s write-up argues that lossless tokenization is theoretically neutral for language modeling: any string distribution can be induced from token distributions, and the canonical construction preserves entropy exactly (H(Q)=H(P)). The practical twist is that real models still leak small probability mass onto non-canonical tokenizations, and controlled noise like BPE-Dropout can improve generalization anyway.

// ANALYSIS

The formal result may feel tautological to experts, but it is still useful because it cleanly separates representational limits from optimization behavior.

  • The post gives a compact proof scaffold people can cite when debating whether tokenization “loses” expressiveness.
  • It reframes non-canonical mass as measurable overhead (entropy gap / marginalization gap), not just intuition.
  • The Chirkova-style finding (small average gap, larger on harder text) explains why this matters most in edge cases, not clean benchmarks.
  • BPE-Dropout is the interesting contradiction: adding controlled tokenization noise can act like data augmentation even when canonical tokenization is information-theoretically optimal.
// TAGS
why-tokens-are-enoughllmresearch

DISCOVERED

26d ago

2026-03-17

PUBLISHED

26d ago

2026-03-16

RELEVANCE

7/ 10

AUTHOR

36845277