BACK_TO_FEEDAICRIER_2
fla-volta unlocks Gated DeltaNet on V100
OPEN_SOURCE ↗
REDDIT · REDDIT// 19d agoOPENSOURCE RELEASE

fla-volta unlocks Gated DeltaNet on V100

InMecha's fla-volta backports native CUDA kernels for Flash Linear Attention's Gated DeltaNet path so it can run on NVIDIA Volta V100 GPUs, where the stock Triton kernels hang on sm_70. The repo is aimed at HuggingFace Transformers users and positions itself as a research-grade compatibility layer for Qwen3.5-class models, with the README showing a modest tok/s lift and a bigger hardware-compatibility win.

// ANALYSIS

This is a rare back-port that feels more like infrastructure preservation than product polish.

  • Replaces two FLA components with handwritten CUDA kernels, including a fused RMSNorm + SiLU gate and a fused recurrent Gated DeltaNet kernel adapted from llama.cpp
  • README benchmarks show 16.8 tok/s on a V100 for Qwen3.5-2B versus 11.5 tok/s with the PyTorch fallback, but the authors say HuggingFace generation overhead caps end-to-end gains
  • The real value is keeping older V100 fleets useful for modern linear-attention models instead of waiting for upstream Triton support to catch up
  • It is explicitly research-only, needs CUDA 12.x plus low-level GPU/CU skills, and the maintainers are not promising active support
// TAGS
fla-voltagpuinferenceopen-sourcellmself-hosted

DISCOVERED

19d ago

2026-03-24

PUBLISHED

19d ago

2026-03-23

RELEVANCE

8/ 10

AUTHOR

Sliouges