REDDIT · REDDIT// 18h agoINFRASTRUCTURE

AMD Infinity Cache Barely Helps Dense Inference

This Reddit thread asks whether the RX 7900 XTX’s 96MB Infinity Cache meaningfully helps dense LLM inference, especially for models like Qwen 27B. The short answer is that it can help a little in specific access patterns, but raw VRAM bandwidth and capacity still do most of the work.

// ANALYSIS

The cache is real hardware, but it is not a magic multiplier for dense autoregressive inference. For large dense models, the dominant cost is still streaming model weights and KV data through memory every token, so the 3.5 TB/s marketing number should not be treated as sustained model bandwidth.

–AMD’s own specs put the 7900 XTX at 24GB GDDR6, a 384-bit bus, 20 Gbps memory, 960 GB/s peak bandwidth, and 96MB Infinity Cache
–Transformer inference is often bandwidth-bound, with weights loaded once per forward pass and KV cache traffic becoming important at longer contexts or larger batches
–Cache can still help with locality inside kernels and some repeated accesses, so it is not useless, just not a reason to expect a 3.5x jump for dense decoding
–For rough capacity planning, comparing effective VRAM bandwidth across cards is still a reasonable first-order heuristic, but backend efficiency and quantization can move results a lot
–The strongest claim here is not “cache does nothing,” but “cache does not erase memory bandwidth as the bottleneck”

// TAGS

amd-radeon-rx-7900-xtxllminferencegpulocal-firstbenchmark

DISCOVERED

18h ago

2026-05-02

PUBLISHED

20h ago

2026-05-02

RELEVANCE

6/ 10

AUTHOR

boutell