OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT
Unsloth Qwen3.6-35B-A3B GGUF hits 44 t/s
A LocalLLaMA user reports Qwen3.6-35B-A3B GGUF Q8_0 running at 44 tokens per second on an RTX 5070 Ti 16GB with 32GB DDR5 RAM. The setup uses a 36.9GB quant, LM Studio offload tuning, and 128K context, making this a practical local-inference datapoint rather than a formal benchmark.
// ANALYSIS
This is the kind of result that matters more than synthetic leaderboard noise: it shows a large MoE model can be made usable on prosumer hardware with careful offload and cache settings.
- –The headline number is impressive, but it depends on a hybrid GPU+CPU setup rather than pure VRAM residency.
- –Q8_0 caches and offloading 26 MoE experts to CPU are doing a lot of the heavy lifting here.
- –The post itself hints that llama.cpp may outperform LM Studio for this workload, so the real story is deployment efficiency, not one fixed speed figure.
- –The 128K context claim is the key practical signal: this is about keeping long-context local workflows alive on midrange hardware.
- –For local AI builders, this reinforces that model choice and runtime tuning can matter as much as raw GPU tier.
// TAGS
qwen3.6-35b-a3bunslothllminferencegpubenchmarkself-hosted
DISCOVERED
4h ago
2026-04-24
PUBLISHED
5h ago
2026-04-24
RELEVANCE
8/ 10
AUTHOR
moahmo88