Hybrid attention speeds Rust code model 50x
A small Rust-focused language model trained from scratch with hybrid local-plus-recurrent attention reached 286 tokens per second on a 4060 Ti, about 50x faster than the full-attention baseline. The main takeaway is that scaling the Rust corpus from about 31MB to 173MB improved validation loss more than the architectural changes.
Strong research signal, but this reads more like a systems and scaling note than a product launch.
- –The clearest result is that data scaling beat architecture tuning at this size; that is the most defensible takeaway.
- –The inference win is substantial, but it appears to come from the cache/compression strategy as much as from the attention formulation itself.
- –Quality evidence is still thin: perplexity is good for a tiny model, but code usefulness should be judged with parsing, compilation, and completion benchmarks.
- –The post would be stronger with ablations for hybrid vs local-only vs recurrent-only, plus earlier-checkpoint generation samples.
- –For a model this small, longer context and better tokenization are likely to matter as much as more exotic attention variants.
DISCOVERED
51d ago
2026-04-07
PUBLISHED
51d ago
2026-04-07
RELEVANCE
AUTHOR
Inevitable_Back3319