llama.cpp Qwen3.6-27B hits 12 tok/s on 9070 XT

// 2h agoBENCHMARK RESULT

llama.cpp Qwen3.6-27B hits 12 tok/s on 9070 XT

A LocalLLaMA user reports 12 tok/s on an RX 9070 XT running a Q3 Qwen 27B model through `llama.cpp` with a 65K context window. The thread is really about whether that throughput is normal and which knobs still matter when long context is non-negotiable.

// ANALYSIS

12 tok/s does not sound outlandish for this workload. A 27B model at 65K context is a worst-case mix of bandwidth pressure, KV-cache cost, and backend overhead, so the real ceiling is probably lower than short-context demos suggest.

–Qwen's docs recommend `llama.cpp >= b5401` for full support, so build vintage matters.
–`-c 65536` is the main tax here; KV cache and attention cost scale hard with context length even when KV is quantized.
–`-fa on`, `--ubatch-size 128`, and `-b 512` are worth testing, but on AMD the backend and driver path often matter more than CPU thread count.
–`--threads 6` is unlikely to move generation speed much once the model is fully GPU-resident.
–If the context floor stays fixed, the realistic speedups are speculative decoding, more aggressive quantization, or a different backend/kernel stack.

// TAGS

llama-cppqwen3-6-27bllminferencegpuquantizationlong-contextopen-source

DISCOVERED

2h ago

2026-05-09

PUBLISHED

5h ago

2026-05-09

RELEVANCE

8/ 10

AUTHOR

Ok-Internal9317

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE1h ago

Claude limit bump fuels weekly-cap doubts

Anthropic says it doubled Claude Code’s five-hour limits for Pro, Max, Team, and seat-based Enterprise plans, and removed peak-hour reductions for Pro and Max. The reaction centers on whether heavy Max users will just hit opaque weekly caps faster.

UPDATE2h ago

Higgsfield MCP powers Claude content factory

Higgsfield’s MCP connector plugs Claude into its image and video generation stack, with workflow pieces like Virality Predictor and Ad Reference folded in. The pitch is agent-native creative production: generate, score, and remix content without leaving the chat.

TUTORIAL2h ago

Blade Ballet shares anime storyboard workflow

Blade Ballet is a prompt-share post that shows how to generate a rough 16-panel anime combat storyboard with GPT Image 2 and then animate it in Seedance 2.0 using the storyboard as sequential keyframe guidance. The thread is aimed at creators who want tightly paced, cinematic fight choreography without starting from a full character reference or traditional animatic pipeline.