OPEN_SOURCE ↗
REDDIT · REDDIT// 12d agoBENCHMARK RESULT
Depth Pruning Transfers Across GPT-2, TinyLlama
This Reddit post reports that removing the most sensitive transformer layers, then recovering with distillation, shrinks GPT-2 and TinyLlama with modest perplexity loss. The surprising part is that the same pruning recipe appears to transfer cleanly from a 124M GPT-2 setup to TinyLlama 1.1B.
// ANALYSIS
This looks less like a new compression miracle than a strong sign that depth is more redundant than width in these models, especially in the middle layers. The interesting bit is the transferability: the same simple pruning heuristic seems to survive a jump across architectures and scale.
- –GPT-2 saw roughly 11-17% parameter reduction with about 9-13% PPL degradation and a reported 1.2x decode speedup
- –TinyLlama 1.1B held up reasonably well at 22L -> 20L and 19L, with 20L giving the cleaner speed win and only a small PPL ratio bump
- –The stability across three seeds suggests the recovery stage is doing real work, not just riding noise
- –Early and late layers still look disproportionately important, which matches a lot of pruning intuition and makes the “best” layer pair a moving target after recovery
- –This is promising for model compression workflows, but it is still a small-scale result and needs broader benchmark coverage before anyone treats it as a general rule
// TAGS
llmbenchmarkresearchdepth-first-pruninggpt-2tinyllama
DISCOVERED
12d ago
2026-03-31
PUBLISHED
12d ago
2026-03-31
RELEVANCE
8/ 10
AUTHOR
califalcon