SATFormer beats static early-reuse baselines
SATFormer is a new Transformer architecture paper that keeps the cheap first-layer value pathway from ResFormer but adds a per-token, per-head gate to control when later layers can reuse early representations. Across 130M to 1.3B models, it improves validation loss and retrieval-heavy benchmark scores while staying close to baseline Transformer throughput.
The interesting move here is not “more connectivity,” it’s “better control.” If early features are only useful some of the time, a selective gate is a cleaner scaling story than spraying dense cross-layer pathways everywhere.
- –On retrieval-intensive tasks, SATFormer improves by about 1.5 average points over ResFormer and narrowly edges out MUDDFormer, which is the right kind of gain for this line of research.
- –The efficiency story matters: it stays close to Transformer/ResFormer throughput and memory, while dense alternatives pay a real wall-clock tax.
- –The gate behavior looks structured, not decorative: sparse, depth-dependent, head-specific, and token/category-sensitive access suggests the model is learning when early representations actually matter.
- –This is a useful design lesson for Transformer variants: the next step may be routing policies over existing signals, not just adding more residual highways.
DISCOVERED
45d ago
2026-05-06
PUBLISHED
45d ago
2026-05-06
RELEVANCE
AUTHOR
Skye7821