YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Qwen3.5-35B-A3B hits 60 tok/s on 4060 Ti

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Qwen3.5-35B-A3B hits 60 tok/s on 4060 Ti
OPEN LINK ↗
// 45d agoBENCHMARK RESULT

Qwen3.5-35B-A3B hits 60 tok/s on 4060 Ti

A LocalLLaMA user reports tuning llama.cpp to run Unsloth’s Qwen3.5-35B-A3B GGUF at 64k context on an RTX 4060 Ti 16GB, with real-world throughput around 40-60 tok/s. The post focuses on the config shape that made the difference: KV-unified routing, batch sizing, MoE CPU offload, and keeping VRAM pressure under control.

// ANALYSIS

This is a useful reality check for local LLM inference: the bottleneck is often runtime configuration, not just GPU size.

  • The win appears to come from the effective runtime shape, not a single magic flag, which is why the author emphasizes `n_parallel`, `kv_unified`, `n_ctx_seq`, `n_ctx_slot`, `n_batch`, and `n_ubatch`.
  • The reported numbers come from actual continuation and long-context runs, so they are more credible than a one-off prompt benchmark.
  • A 16GB consumer card handling a 35B-class MoE model at 64k context makes local deployment feel practical for serious dev workflows, not just hobby demos.
  • The post also points to a gap in the ecosystem: people need shared, hardware-specific config baselines instead of rediscovering the same tuning tricks.
// TAGS
qwen3.5-35b-a3bllama.cppllminferencegpuopen-weightsself-hostedbenchmark

DISCOVERED

45d ago

2026-04-16

PUBLISHED

46d ago

2026-04-15

RELEVANCE

8/ 10

AUTHOR

Nutty_Praline404