BACK_TO_FEEDAICRIER_2
llama.cpp prefill performance drops on ROCm
OPEN_SOURCE ↗
REDDIT · REDDIT// 19d agoBENCHMARK RESULT

llama.cpp prefill performance drops on ROCm

A user on Reddit reported a ~9% drop in prompt processing (prefill) performance for llama.cpp when running on ROCm with an AMD 7900XTX, despite a 15% increase in token generation speed. The benchmarking, conducted on OpenSUSE Tumbleweed, suggests that recent updates or changes in optimal unified block (ub) sizes—shifting from 256 to 128—may be contributing to the inconsistency in real-world ROCm performance.

// ANALYSIS

While token generation improvements are a plus, the regression in prefill performance is a critical blow for users relying on fast initial response times in multi-GPU setups. ROCm's optimization path appears to favor throughput over latency in recent commits, potentially alienating users with complex workflows. Furthermore, the shift in optimal unified block sizes suggests underlying changes in memory management on RDNA 3 hardware, raising concerns about PCIe bandwidth efficiency for dual-GPU configurations.

// TAGS
llama-cpprocmamd7900xtxbenchmarksllmlocal-llm

DISCOVERED

19d ago

2026-03-24

PUBLISHED

19d ago

2026-03-24

RELEVANCE

8/ 10

AUTHOR

ROS_SDN