Why Tokens Are Enough formalizes tokenizer entropy gap
Doug’s write-up argues that lossless tokenization is theoretically neutral for language modeling: any string distribution can be induced from token distributions, and the canonical construction preserves entropy exactly (H(Q)=H(P)). The practical twist is that real models still leak small probability mass onto non-canonical tokenizations, and controlled noise like BPE-Dropout can improve generalization anyway.
The formal result may feel tautological to experts, but it is still useful because it cleanly separates representational limits from optimization behavior.
- –The post gives a compact proof scaffold people can cite when debating whether tokenization “loses” expressiveness.
- –It reframes non-canonical mass as measurable overhead (entropy gap / marginalization gap), not just intuition.
- –The Chirkova-style finding (small average gap, larger on harder text) explains why this matters most in edge cases, not clean benchmarks.
- –BPE-Dropout is the interesting contradiction: adding controlled tokenization noise can act like data augmentation even when canonical tokenization is information-theoretically optimal.
DISCOVERED
71d ago
2026-03-17
PUBLISHED
72d ago
2026-03-16
RELEVANCE
AUTHOR
36845277