BACK_TO_FEEDAICRIER_2
ComfyUI-FeatherOps shows fp8 gains on RDNA3
OPEN_SOURCE ↗
REDDIT · REDDIT// 21d agoOPENSOURCE RELEASE

ComfyUI-FeatherOps shows fp8 gains on RDNA3

FeatherOps adds a HIP kernel for fp16@fp8e5m2 matmul in ComfyUI and shows that RDNA3/3.5 GPUs can still get meaningful fp8-era speedups without native fp8 instructions. The author says it is still a proof of concept, but the same kernel may also help LLM training and batch-1 decoding.

// ANALYSIS

This is a smart bandwidth-and-occupancy play, not a magic precision trick. The interesting part is that AMD users can squeeze more out of existing matrix hardware by changing where the data lives and how long it stays there.

  • The kernel keeps fp8 weights in LDS and upcasts late, which trims VRAM-to-LDS traffic and reduces instruction pressure in the K-loop.
  • The repo reports 52 TFLOPS in C++ and 43 TFLOPS in Python on Strix Halo, versus 30 TFLOPS for torch fp16, so the path can beat default ROCm kernels when overhead is controlled.
  • The author is explicit that bigger diffusion-sized matrices dilute the win, so this looks more compelling for smaller linear layers and batch-1 decode-like workloads.
  • It is handwritten HIP/asm rather than Triton or Tensile, which leaves headroom for deeper tuning but also means more sensitivity to ROCm and driver churn.
// TAGS
gpuinferencellmopen-sourcecomfyui-featherops

DISCOVERED

21d ago

2026-03-22

PUBLISHED

21d ago

2026-03-22

RELEVANCE

8/ 10

AUTHOR

woct0rdho