X · X// 5h agoBENCHMARK RESULT

PyTorch solve_ex hits 10x 512 cliff

John Carmack flagged a sharp GPU performance discontinuity in batched `torch.linalg.solve_ex()`, where 512x512 matrices ran more than 10x slower than 511x511. The post is a reminder that linear-algebra backends can change behavior abruptly at exact shape thresholds.

// ANALYSIS

My read is this is almost certainly a backend-selection or kernel-tuning cliff, not a mathematical property of the solve itself. GPU library performance is piecewise, not smooth, so a one-element shape change can matter more than a larger algorithmic change.

–Exact matrix sizes can flip PyTorch/cuSOLVER/MAGMA onto different kernels, tile sizes, or workspace paths
–Batched linalg is especially sensitive because the “same” operation can hit very different implementation paths as dimensions cross a threshold
–If you rely on GPU solves in training or inference, benchmark neighboring shapes, dtypes, and batch sizes before you lock architecture
–Padding or chunking away from pathological sizes is often the pragmatic fix when you hit a cliff like this

// TAGS

pytorchgpubenchmarkopen-sourcemlops

DISCOVERED

5h ago

2026-04-29

PUBLISHED

7h ago

2026-04-29

RELEVANCE

8/ 10

AUTHOR

ID_AA_Carmack