OPEN_SOURCE ↗
REDDIT · REDDIT// 29d agoBENCHMARK RESULT
Qwen 3.5 27B hits 2,000 TPS
A LocalLLaMA user reports roughly 2,000 tokens/sec prefill throughput for markdown-document classification using an Unsloth Q5_K_XL GGUF build of Qwen 3.5 27B on an RTX 5090 with llama.cpp CUDA 13. The setup is tuned for long inputs, minimal outputs, and batch parallelism, making it strong for high-volume classification but highly workload-specific.
// ANALYSIS
This is a strong real-world throughput datapoint for local inference, but it should be read as a specialized benchmark rather than a general performance baseline.
- –The reported speed is dominated by input-heavy prefill, not long-form generation throughput.
- –Disabling vision/mmproj and using “no thinking” removed extra compute paths for this text-only task.
- –Reducing context to 128k and matching parallelism to batch size (8) helped keep VRAM pressure controlled.
- –The author notes evals are still partial, so accuracy and quality tradeoffs need fuller validation.
// TAGS
qwen3-5-27bllminferencegpubenchmarkllama-cpp
DISCOVERED
29d ago
2026-03-14
PUBLISHED
29d ago
2026-03-13
RELEVANCE
8/ 10
AUTHOR
awitod