BACK_TO_FEEDAICRIER_2
Unsloth Qwen3.6-35B-A3B GGUF hits 44 t/s
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT

Unsloth Qwen3.6-35B-A3B GGUF hits 44 t/s

A LocalLLaMA user reports Qwen3.6-35B-A3B GGUF Q8_0 running at 44 tokens per second on an RTX 5070 Ti 16GB with 32GB DDR5 RAM. The setup uses a 36.9GB quant, LM Studio offload tuning, and 128K context, making this a practical local-inference datapoint rather than a formal benchmark.

// ANALYSIS

This is the kind of result that matters more than synthetic leaderboard noise: it shows a large MoE model can be made usable on prosumer hardware with careful offload and cache settings.

  • The headline number is impressive, but it depends on a hybrid GPU+CPU setup rather than pure VRAM residency.
  • Q8_0 caches and offloading 26 MoE experts to CPU are doing a lot of the heavy lifting here.
  • The post itself hints that llama.cpp may outperform LM Studio for this workload, so the real story is deployment efficiency, not one fixed speed figure.
  • The 128K context claim is the key practical signal: this is about keeping long-context local workflows alive on midrange hardware.
  • For local AI builders, this reinforces that model choice and runtime tuning can matter as much as raw GPU tier.
// TAGS
qwen3.6-35b-a3bunslothllminferencegpubenchmarkself-hosted

DISCOVERED

4h ago

2026-04-24

PUBLISHED

5h ago

2026-04-24

RELEVANCE

8/ 10

AUTHOR

moahmo88