OPEN_SOURCE ↗
REDDIT · REDDIT// 19d agoOPENSOURCE RELEASE
fla-volta unlocks Gated DeltaNet on V100
InMecha's fla-volta backports native CUDA kernels for Flash Linear Attention's Gated DeltaNet path so it can run on NVIDIA Volta V100 GPUs, where the stock Triton kernels hang on sm_70. The repo is aimed at HuggingFace Transformers users and positions itself as a research-grade compatibility layer for Qwen3.5-class models, with the README showing a modest tok/s lift and a bigger hardware-compatibility win.
// ANALYSIS
This is a rare back-port that feels more like infrastructure preservation than product polish.
- –Replaces two FLA components with handwritten CUDA kernels, including a fused RMSNorm + SiLU gate and a fused recurrent Gated DeltaNet kernel adapted from llama.cpp
- –README benchmarks show 16.8 tok/s on a V100 for Qwen3.5-2B versus 11.5 tok/s with the PyTorch fallback, but the authors say HuggingFace generation overhead caps end-to-end gains
- –The real value is keeping older V100 fleets useful for modern linear-attention models instead of waiting for upstream Triton support to catch up
- –It is explicitly research-only, needs CUDA 12.x plus low-level GPU/CU skills, and the maintainers are not promising active support
// TAGS
fla-voltagpuinferenceopen-sourcellmself-hosted
DISCOVERED
19d ago
2026-03-24
PUBLISHED
19d ago
2026-03-23
RELEVANCE
8/ 10
AUTHOR
Sliouges