llama.cpp build b8464 hits 10k tokens/sec on R9700
A major update to llama.cpp (b8464) pushes prompt processing speeds to 10,907 tokens per second on AMD’s Radeon AI PRO R9700. The RDNA4-optimized build triples throughput for Qwen 3.5, making instant 128k context evaluation a reality for local developers.
The R9700's leap from 4k to 10k tokens per second makes advanced techniques like Multi-Token Prediction and speculative decoding virtually instantaneous on consumer hardware. Build b8464 introduces fused Gated Delta Network kernels for Qwen 3.5, reducing graph splits and keeping the entire computation on the GPU's high-bandwidth memory. Flash Attention is the primary driver here, allowing developers to maintain this throughput even at 128k context windows. This performance tier effectively eliminates cold-start latency for RAG applications, as large context blocks can now be pre-processed in milliseconds. For AI developers, the R9700 is emerging as a cost-effective alternative to datacenter silicon for iterative model prototyping and high-throughput agent loops.
DISCOVERED
21d ago
2026-03-22
PUBLISHED
21d ago
2026-03-22
RELEVANCE
AUTHOR
greenail