llama.cpp prefill performance drops on ROCm
A user on Reddit reported a ~9% drop in prompt processing (prefill) performance for llama.cpp when running on ROCm with an AMD 7900XTX, despite a 15% increase in token generation speed. The benchmarking, conducted on OpenSUSE Tumbleweed, suggests that recent updates or changes in optimal unified block (ub) sizes—shifting from 256 to 128—may be contributing to the inconsistency in real-world ROCm performance.
While token generation improvements are a plus, the regression in prefill performance is a critical blow for users relying on fast initial response times in multi-GPU setups. ROCm's optimization path appears to favor throughput over latency in recent commits, potentially alienating users with complex workflows. Furthermore, the shift in optimal unified block sizes suggests underlying changes in memory management on RDNA 3 hardware, raising concerns about PCIe bandwidth efficiency for dual-GPU configurations.
DISCOVERED
19d ago
2026-03-24
PUBLISHED
19d ago
2026-03-24
RELEVANCE
AUTHOR
ROS_SDN