Unsloth Qwen3.6-35B-A3B GGUF hits 44 t/s

// 49d agoBENCHMARK RESULT

Unsloth Qwen3.6-35B-A3B GGUF hits 44 t/s

A LocalLLaMA user reports Qwen3.6-35B-A3B GGUF Q8_0 running at 44 tokens per second on an RTX 5070 Ti 16GB with 32GB DDR5 RAM. The setup uses a 36.9GB quant, LM Studio offload tuning, and 128K context, making this a practical local-inference datapoint rather than a formal benchmark.

// ANALYSIS

This is the kind of result that matters more than synthetic leaderboard noise: it shows a large MoE model can be made usable on prosumer hardware with careful offload and cache settings.

–The headline number is impressive, but it depends on a hybrid GPU+CPU setup rather than pure VRAM residency.
–Q8_0 caches and offloading 26 MoE experts to CPU are doing a lot of the heavy lifting here.
–The post itself hints that llama.cpp may outperform LM Studio for this workload, so the real story is deployment efficiency, not one fixed speed figure.
–The 128K context claim is the key practical signal: this is about keeping long-context local workflows alive on midrange hardware.
–For local AI builders, this reinforces that model choice and runtime tuning can matter as much as raw GPU tier.

// TAGS

qwen3.6-35b-a3bunslothllminferencegpubenchmarkself-hosted

DISCOVERED

49d ago

2026-04-24

PUBLISHED

49d ago

2026-04-24

RELEVANCE

8/ 10

AUTHOR

moahmo88

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL16m ago

Moonshot AI has officially released Kimi K2.7-Code, an open-weights coding model optimized for long-horizon software engineering and cost-efficient agentic reasoning.

Moonshot AI has officially released Kimi K2.7-Code, an open-weights Mixture-of-Experts coding model featuring 1 trillion parameters and a 256K context window. Optimized for long-horizon software engineering tasks like codebase-wide refactoring and debugging, the model achieves a 30% reduction in reasoning-token usage compared to its predecessor. Kimi K2.7-Code supports multimodal inputs, runs in a dedicated reasoning-heavy thinking mode, and is available for developers via Hugging Face, Ollama, and the Kimi API.

TUTORIAL48m ago

Seedance 2.0 workflow animates consistent characters

AI creator Aimi Kōda shared a step-by-step generative AI workflow titled "Surf on the Clouds" that coordinates Midjourney, GPT Image 2, and Seedance 2.0. The tutorial explains how to generate a stylized character in Midjourney, build a structured 16:9 character identity sheet using GPT Image 2, and animate the assets using Seedance 2.0 to maintain visual and narrative consistency across scenes.

MODEL1h ago

Claude Fable 5 overshadows Claude Opus 4.8

The rapid succession of Anthropic's model releases has left Claude Opus 4.8—which debuted just two weeks ago as a major frontier model—largely forgotten in the wake of Claude Fable 5. Fable 5's introduction as the first generally available 'Mythos-class' model has generated massive hype due to its superior score of 80.3% on SWE-bench Pro and impressive multi-step autonomous planning, completely shifting the AI community's focus and discussions away from the incremental updates of Opus 4.8.

Unsloth Qwen3.6-35B-A3B GGUF hits 44 t/s