llama.cpp Windows slowdown stumps AMD users

// 2h agoINFRASTRUCTURE

llama.cpp Windows slowdown stumps AMD users

A Reddit user says llama.cpp on Windows 11 with an AMD 7900 XTX starts fast, then throughput suddenly drops from about 39 tok/s to 15 tok/s and only a full reboot restores it. The same setup works normally on Linux, which points to a Windows-specific runtime, driver, or power-management problem rather than model settings.

// ANALYSIS

This looks less like a llama.cpp bug than a bad interaction between Windows, AMD GPU drivers, and backend state in the runtime. Restarting llama.cpp and the graphics driver not fixing it suggests the slowdown is happening below the app layer, likely in driver state, memory placement, or power management behavior. The Windows-versus-Linux split is the strongest signal here: same hardware, same models, different outcome. Related llama.cpp discussions have already pointed at Windows ROCm and shared-memory quirks as well as backend-specific regressions, so this fits a broader pattern of brittle AMD-on-Windows inference behavior. The fact that multiple models and context sizes hit the same ceiling makes a model-specific optimization bug unlikely, and it is another reminder that stable throughput matters as much as peak tok/s.

// TAGS

llminferencegpudebuggingopen-sourcelocal-firstllama-cpp

DISCOVERED

2h ago

2026-05-11

PUBLISHED

5h ago

2026-05-11

RELEVANCE

7/ 10

AUTHOR

soyalemujica

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE46m ago

Claude Code teases imminent 2.1.139 release

Claude Code appears to be on the verge of a 2.1.139 release, based on a brief X post that signals an imminent ship rather than sharing changelog details. The post is too thin to confirm feature changes, so this should be read as an upcoming product update for existing Claude Code users rather than a broader launch announcement.

UPDATE1h ago

Bugbot adds PR review effort controls

Cursor now lets teams and individual Bugbot users choose how deeply the PR reviewer thinks, with default, high, and custom effort modes. High effort spends more time reasoning and, per Cursor, finds about 35% more bugs than default while keeping the same merge-time resolution rate.

UPDATE1h ago

ElevenLabs Adds Studio Agent to ElevenCreative

Studio Agent is a conversational AI co-editor built into the ElevenCreative Studio timeline. It can take a prompt, ask clarifying questions, and draft a first cut by placing clips, generating voiceovers, finding voices, syncing sound effects, and building a video rough cut while still letting the user take manual control at any point. This is an extension of ElevenLabs’ broader ElevenCreative platform rather than a separate standalone app.