Qwen3.5-35B-A3B hits 60 tok/s on 4060 Ti

// 90d agoBENCHMARK RESULT

Qwen3.5-35B-A3B hits 60 tok/s on 4060 Ti

A LocalLLaMA user reports tuning llama.cpp to run Unsloth’s Qwen3.5-35B-A3B GGUF at 64k context on an RTX 4060 Ti 16GB, with real-world throughput around 40-60 tok/s. The post focuses on the config shape that made the difference: KV-unified routing, batch sizing, MoE CPU offload, and keeping VRAM pressure under control.

// ANALYSIS

This is a useful reality check for local LLM inference: the bottleneck is often runtime configuration, not just GPU size.

–The win appears to come from the effective runtime shape, not a single magic flag, which is why the author emphasizes `n_parallel`, `kv_unified`, `n_ctx_seq`, `n_ctx_slot`, `n_batch`, and `n_ubatch`.
–The reported numbers come from actual continuation and long-context runs, so they are more credible than a one-off prompt benchmark.
–A 16GB consumer card handling a 35B-class MoE model at 64k context makes local deployment feel practical for serious dev workflows, not just hobby demos.
–The post also points to a gap in the ecosystem: people need shared, hardware-specific config baselines instead of rediscovering the same tuning tricks.

// TAGS

qwen3.5-35b-a3bllama.cppllminferencegpuopen-weightsself-hostedbenchmark

DISCOVERED

90d ago

2026-04-16

PUBLISHED

91d ago

2026-04-15

RELEVANCE

8/ 10

AUTHOR

Nutty_Praline404

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE1h ago

Mindwalk visualizes AI agent sessions in 3D

Mindwalk is an open-source local tool that replays an AI coding agent's terminal session by illuminating the files it reads and edits on a 3D visualization of the repository. By scanning local projects and session logs, it renders a browser-based "night map" where files glow with specific colors (moss green for seen, moon white for read, warm amber for edited, and dark for unvisited), allowing developers to easily trace the agent's path, discover hallucination loops, and verify its overall pathfinding efficiency.

OPEN SOURCE1h ago

Clodex IDE launches open-source agentic sandbox

Clodex is an open-source, local-first agentic IDE designed to run autonomous AI tasks in isolated, user-approved environments. By treating engineering work as stateful tasks, it retains context across sessions, routes queries dynamically between models, and generates cryptographically signed evidence records for all operations.

OPEN SOURCE1h ago

Waggle optimizes multi-agent handoffs

Waggle is an open-source Rust library and MCP-native reference layer designed to streamline multi-agent workflows by passing compact, ~30-byte versioned reference tokens instead of massive context files during handoffs. Subagents resolve these tokens via the Model Context Protocol to retrieve only the specific data segments they need, reducing token bloat and enabling efficient context shaping and read attribution.