llama.cpp Qwen3.6-27B hits 12 tok/s on 9070 XT
A LocalLLaMA user reports 12 tok/s on an RX 9070 XT running a Q3 Qwen 27B model through `llama.cpp` with a 65K context window. The thread is really about whether that throughput is normal and which knobs still matter when long context is non-negotiable.
12 tok/s does not sound outlandish for this workload. A 27B model at 65K context is a worst-case mix of bandwidth pressure, KV-cache cost, and backend overhead, so the real ceiling is probably lower than short-context demos suggest.
- –Qwen's docs recommend `llama.cpp >= b5401` for full support, so build vintage matters.
- –`-c 65536` is the main tax here; KV cache and attention cost scale hard with context length even when KV is quantized.
- –`-fa on`, `--ubatch-size 128`, and `-b 512` are worth testing, but on AMD the backend and driver path often matter more than CPU thread count.
- –`--threads 6` is unlikely to move generation speed much once the model is fully GPU-resident.
- –If the context floor stays fixed, the realistic speedups are speculative decoding, more aggressive quantization, or a different backend/kernel stack.
DISCOVERED
2h ago
2026-05-09
PUBLISHED
5h ago
2026-05-09
RELEVANCE
AUTHOR
Ok-Internal9317