REDDIT · REDDIT// 4h agoINFRASTRUCTURE

llama.cpp adds flash-attn for Mistral Small 4

llama.cpp is adding CUDA flash-attn coverage for DKQ=320/DV=256 with ncols2=32, which keeps Mistral Small 4 off the CPU fallback path. The PR targets the model’s GQA=32 shape and shows a large throughput jump on an RTX PRO 6000 Blackwell system.

// ANALYSIS

This is the kind of low-level kernel work that quietly changes real-world model performance: one missing attention shape can erase most of the benefit of GPU offload.

–The patch adds MMA-f16 and tile kernel configs, dispatch logic, template instances, and a dedicated tile .cu path for the 320/256 shape.
–The key constraint is ncols2=32, which matches the model’s GQA ratio and narrows the fix to the exact path Mistral Small 4 needs.
–Benchmarks in the PR show a dramatic gap versus fallback behavior: prefill jumps from roughly 180 tok/s to about 3.7k tok/s, and decode from about 33 tok/s to about 186 tok/s.
–The upside is obvious for CUDA users running this model; the downside is that the fix is highly shape-specific, so broader coverage will still matter for future variants.

// TAGS

llmgpuinferenceopen-sourcellama-cpp

DISCOVERED

4h ago

2026-04-29

PUBLISHED

6h ago

2026-04-28

RELEVANCE

8/ 10

AUTHOR

jacek2023