OPEN_SOURCE ↗
REDDIT · REDDIT// 21d agoOPENSOURCE RELEASE
ComfyUI-FeatherOps shows fp8 gains on RDNA3
FeatherOps adds a HIP kernel for fp16@fp8e5m2 matmul in ComfyUI and shows that RDNA3/3.5 GPUs can still get meaningful fp8-era speedups without native fp8 instructions. The author says it is still a proof of concept, but the same kernel may also help LLM training and batch-1 decoding.
// ANALYSIS
This is a smart bandwidth-and-occupancy play, not a magic precision trick. The interesting part is that AMD users can squeeze more out of existing matrix hardware by changing where the data lives and how long it stays there.
- –The kernel keeps fp8 weights in LDS and upcasts late, which trims VRAM-to-LDS traffic and reduces instruction pressure in the K-loop.
- –The repo reports 52 TFLOPS in C++ and 43 TFLOPS in Python on Strix Halo, versus 30 TFLOPS for torch fp16, so the path can beat default ROCm kernels when overhead is controlled.
- –The author is explicit that bigger diffusion-sized matrices dilute the win, so this looks more compelling for smaller linear layers and batch-1 decode-like workloads.
- –It is handwritten HIP/asm rather than Triton or Tensile, which leaves headroom for deeper tuning but also means more sensitivity to ROCm and driver churn.
// TAGS
gpuinferencellmopen-sourcecomfyui-featherops
DISCOVERED
21d ago
2026-03-22
PUBLISHED
21d ago
2026-03-22
RELEVANCE
8/ 10
AUTHOR
woct0rdho