llama.cpp Windows slowdown stumps AMD users
A Reddit user says llama.cpp on Windows 11 with an AMD 7900 XTX starts fast, then throughput suddenly drops from about 39 tok/s to 15 tok/s and only a full reboot restores it. The same setup works normally on Linux, which points to a Windows-specific runtime, driver, or power-management problem rather than model settings.
This looks less like a llama.cpp bug than a bad interaction between Windows, AMD GPU drivers, and backend state in the runtime. Restarting llama.cpp and the graphics driver not fixing it suggests the slowdown is happening below the app layer, likely in driver state, memory placement, or power management behavior. The Windows-versus-Linux split is the strongest signal here: same hardware, same models, different outcome. Related llama.cpp discussions have already pointed at Windows ROCm and shared-memory quirks as well as backend-specific regressions, so this fits a broader pattern of brittle AMD-on-Windows inference behavior. The fact that multiple models and context sizes hit the same ceiling makes a model-specific optimization bug unlikely, and it is another reminder that stable throughput matters as much as peak tok/s.
DISCOVERED
2h ago
2026-05-11
PUBLISHED
5h ago
2026-05-11
RELEVANCE
AUTHOR
soyalemujica