YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

llama.cpp Qwen3.6-27B hits 12 tok/s on 9070 XT

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

llama.cpp Qwen3.6-27B hits 12 tok/s on 9070 XT
OPEN LINK ↗
// 2h agoBENCHMARK RESULT

llama.cpp Qwen3.6-27B hits 12 tok/s on 9070 XT

A LocalLLaMA user reports 12 tok/s on an RX 9070 XT running a Q3 Qwen 27B model through `llama.cpp` with a 65K context window. The thread is really about whether that throughput is normal and which knobs still matter when long context is non-negotiable.

// ANALYSIS

12 tok/s does not sound outlandish for this workload. A 27B model at 65K context is a worst-case mix of bandwidth pressure, KV-cache cost, and backend overhead, so the real ceiling is probably lower than short-context demos suggest.

  • Qwen's docs recommend `llama.cpp >= b5401` for full support, so build vintage matters.
  • `-c 65536` is the main tax here; KV cache and attention cost scale hard with context length even when KV is quantized.
  • `-fa on`, `--ubatch-size 128`, and `-b 512` are worth testing, but on AMD the backend and driver path often matter more than CPU thread count.
  • `--threads 6` is unlikely to move generation speed much once the model is fully GPU-resident.
  • If the context floor stays fixed, the realistic speedups are speculative decoding, more aggressive quantization, or a different backend/kernel stack.
// TAGS
llama-cppqwen3-6-27bllminferencegpuquantizationlong-contextopen-source

DISCOVERED

2h ago

2026-05-09

PUBLISHED

5h ago

2026-05-09

RELEVANCE

8/ 10

AUTHOR

Ok-Internal9317