Qwen3.5-35B-A3B hits 26 t/s at 100K context
A LocalLLaMA user benchmark shows Qwen3.5-35B-A3B (Unsloth UD-Q4_K_XL in llama.cpp) sustaining 26.18 t/s generation at a 100,000-token context on an RTX 4060 8GB laptop with 64GB system RAM. The result highlights how aggressive quantization plus CPU offload can make long-context local inference viable on consumer hardware, even if it remains a tradeoff-heavy setup.
This is a strong real-world datapoint for budget local AI: 100K context is no longer exclusive to high-VRAM rigs, but memory bandwidth and offload strategy now matter as much as raw GPU class.
- –Generation speed drops from 34.93 t/s at 5K to 26.18 t/s at 100K, showing predictable long-context degradation but still usable throughput.
- –The setup relies on partial CPU offload (`-ngl 99`, model not fully in VRAM), so portability depends heavily on having large, fast system RAM.
- –Compared with recent Strix Halo community tests, this supports the idea that unified-memory systems can improve headroom, but may not automatically unlock dramatically larger model classes.
- –For buyers deciding between integrated high-memory systems and discrete GPUs (like RX 7900 XTX), this benchmark reinforces that workload profile (context length vs model size vs quant quality) should drive the upgrade path.
DISCOVERED
70d ago
2026-03-17
PUBLISHED
70d ago
2026-03-17
RELEVANCE
AUTHOR
External_Dentist1928

