OPEN_SOURCE ↗
X · X// 5h agoBENCHMARK RESULT
PyTorch solve_ex hits 10x 512 cliff
John Carmack flagged a sharp GPU performance discontinuity in batched `torch.linalg.solve_ex()`, where 512x512 matrices ran more than 10x slower than 511x511. The post is a reminder that linear-algebra backends can change behavior abruptly at exact shape thresholds.
// ANALYSIS
My read is this is almost certainly a backend-selection or kernel-tuning cliff, not a mathematical property of the solve itself. GPU library performance is piecewise, not smooth, so a one-element shape change can matter more than a larger algorithmic change.
- –Exact matrix sizes can flip PyTorch/cuSOLVER/MAGMA onto different kernels, tile sizes, or workspace paths
- –Batched linalg is especially sensitive because the “same” operation can hit very different implementation paths as dimensions cross a threshold
- –If you rely on GPU solves in training or inference, benchmark neighboring shapes, dtypes, and batch sizes before you lock architecture
- –Padding or chunking away from pathological sizes is often the pragmatic fix when you hit a cliff like this
// TAGS
pytorchgpubenchmarkopen-sourcemlops
DISCOVERED
5h ago
2026-04-29
PUBLISHED
7h ago
2026-04-29
RELEVANCE
8/ 10
AUTHOR
ID_AA_Carmack