Why Tokens Are Enough formalizes tokenizer entropy gap

// 117d agoNEWS

Why Tokens Are Enough formalizes tokenizer entropy gap

Doug’s write-up argues that lossless tokenization is theoretically neutral for language modeling: any string distribution can be induced from token distributions, and the canonical construction preserves entropy exactly (H(Q)=H(P)). The practical twist is that real models still leak small probability mass onto non-canonical tokenizations, and controlled noise like BPE-Dropout can improve generalization anyway.

// ANALYSIS

The formal result may feel tautological to experts, but it is still useful because it cleanly separates representational limits from optimization behavior.

–The post gives a compact proof scaffold people can cite when debating whether tokenization “loses” expressiveness.
–It reframes non-canonical mass as measurable overhead (entropy gap / marginalization gap), not just intuition.
–The Chirkova-style finding (small average gap, larger on harder text) explains why this matters most in edge cases, not clean benchmarks.
–BPE-Dropout is the interesting contradiction: adding controlled tokenization noise can act like data augmentation even when canonical tokenization is information-theoretically optimal.

// TAGS

why-tokens-are-enoughllmresearch

DISCOVERED

117d ago

2026-03-17

PUBLISHED

118d ago

2026-03-16

RELEVANCE

7/ 10

AUTHOR

36845277

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS33m ago

Ivan Raskovsky, CTO and Co-founder of GenLayer Foundation, joins RallyOnChain to discuss the protocol's Internet Court initiative and the upcoming Clark Testnet roadmap.

GenLayer Foundation's CTO and Co-founder, Ivan Raskovsky, was featured on the RallyOnChain Community Space (Episode 27) hosted by stargirl_hills and 0X_CUPZ. The discussion centered on GenLayer's vision for an "Internet Court"—a decentralized system enabling AI agents to resolve subjective disputes using natural language processing and consensus. Raskovsky highlighted their progress, including an internal Epoch Zero test run and the roadmap for the upcoming Clark Testnet, which is targeted at autonomous network operations following their initial Asimov and Bradbury testnets.

UPDATE1h ago

Native SDK v0.5 compiles TypeScript to native

Vercel Labs has released Native SDK v0.5, introducing TypeScript support to compile applications directly to native machine code without a JavaScript engine or garbage collector. Designed with AI agents in mind, the update features 83ns update dispatch latency, supports robust TypeScript features, and allows developers to eject to Zig at any point.

UPDATE2h ago

SST Console demos AI-built settings screen

SST co-founder Dax Raad demonstrated a new settings screen for the SST Console built entirely via an interactive, Slack-integrated AI coding agent. The development involved collaborative team prompting and iterative feedback loops with the agent, resulting in a functional interface and automated walkthrough video.