BACK_TO_FEEDAICRIER_2
Qwen3.5-35B-A3B hits 26 t/s at 100K context
OPEN_SOURCE ↗
REDDIT · REDDIT// 25d agoBENCHMARK RESULT

Qwen3.5-35B-A3B hits 26 t/s at 100K context

A LocalLLaMA user benchmark shows Qwen3.5-35B-A3B (Unsloth UD-Q4_K_XL in llama.cpp) sustaining 26.18 t/s generation at a 100,000-token context on an RTX 4060 8GB laptop with 64GB system RAM. The result highlights how aggressive quantization plus CPU offload can make long-context local inference viable on consumer hardware, even if it remains a tradeoff-heavy setup.

// ANALYSIS

This is a strong real-world datapoint for budget local AI: 100K context is no longer exclusive to high-VRAM rigs, but memory bandwidth and offload strategy now matter as much as raw GPU class.

  • Generation speed drops from 34.93 t/s at 5K to 26.18 t/s at 100K, showing predictable long-context degradation but still usable throughput.
  • The setup relies on partial CPU offload (`-ngl 99`, model not fully in VRAM), so portability depends heavily on having large, fast system RAM.
  • Compared with recent Strix Halo community tests, this supports the idea that unified-memory systems can improve headroom, but may not automatically unlock dramatically larger model classes.
  • For buyers deciding between integrated high-memory systems and discrete GPUs (like RX 7900 XTX), this benchmark reinforces that workload profile (context length vs model size vs quant quality) should drive the upgrade path.
// TAGS
qwen3.5-35b-a3bllmbenchmarkgpuinferenceself-hostedopen-weights

DISCOVERED

25d ago

2026-03-17

PUBLISHED

25d ago

2026-03-17

RELEVANCE

8/ 10

AUTHOR

External_Dentist1928