OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoINFRASTRUCTURE
llama.cpp adds flash-attn for Mistral Small 4
llama.cpp is adding CUDA flash-attn coverage for DKQ=320/DV=256 with ncols2=32, which keeps Mistral Small 4 off the CPU fallback path. The PR targets the model’s GQA=32 shape and shows a large throughput jump on an RTX PRO 6000 Blackwell system.
// ANALYSIS
This is the kind of low-level kernel work that quietly changes real-world model performance: one missing attention shape can erase most of the benefit of GPU offload.
- –The patch adds MMA-f16 and tile kernel configs, dispatch logic, template instances, and a dedicated tile .cu path for the 320/256 shape.
- –The key constraint is ncols2=32, which matches the model’s GQA ratio and narrows the fix to the exact path Mistral Small 4 needs.
- –Benchmarks in the PR show a dramatic gap versus fallback behavior: prefill jumps from roughly 180 tok/s to about 3.7k tok/s, and decode from about 33 tok/s to about 186 tok/s.
- –The upside is obvious for CUDA users running this model; the downside is that the fix is highly shape-specific, so broader coverage will still matter for future variants.
// TAGS
llmgpuinferenceopen-sourcellama-cpp
DISCOVERED
4h ago
2026-04-29
PUBLISHED
6h ago
2026-04-28
RELEVANCE
8/ 10
AUTHOR
jacek2023