OPEN_SOURCE ↗
REDDIT · REDDIT// 26d agoNEWS
Why Tokens Are Enough formalizes tokenizer entropy gap
Doug’s write-up argues that lossless tokenization is theoretically neutral for language modeling: any string distribution can be induced from token distributions, and the canonical construction preserves entropy exactly (H(Q)=H(P)). The practical twist is that real models still leak small probability mass onto non-canonical tokenizations, and controlled noise like BPE-Dropout can improve generalization anyway.
// ANALYSIS
The formal result may feel tautological to experts, but it is still useful because it cleanly separates representational limits from optimization behavior.
- –The post gives a compact proof scaffold people can cite when debating whether tokenization “loses” expressiveness.
- –It reframes non-canonical mass as measurable overhead (entropy gap / marginalization gap), not just intuition.
- –The Chirkova-style finding (small average gap, larger on harder text) explains why this matters most in edge cases, not clean benchmarks.
- –BPE-Dropout is the interesting contradiction: adding controlled tokenization noise can act like data augmentation even when canonical tokenization is information-theoretically optimal.
// TAGS
why-tokens-are-enoughllmresearch
DISCOVERED
26d ago
2026-03-17
PUBLISHED
26d ago
2026-03-16
RELEVANCE
7/ 10
AUTHOR
36845277