Qwen3.5-35B-A3B hits 26 t/s at 100K context

// 117d agoBENCHMARK RESULT

Qwen3.5-35B-A3B hits 26 t/s at 100K context

A LocalLLaMA user benchmark shows Qwen3.5-35B-A3B (Unsloth UD-Q4_K_XL in llama.cpp) sustaining 26.18 t/s generation at a 100,000-token context on an RTX 4060 8GB laptop with 64GB system RAM. The result highlights how aggressive quantization plus CPU offload can make long-context local inference viable on consumer hardware, even if it remains a tradeoff-heavy setup.

// ANALYSIS

This is a strong real-world datapoint for budget local AI: 100K context is no longer exclusive to high-VRAM rigs, but memory bandwidth and offload strategy now matter as much as raw GPU class.

–Generation speed drops from 34.93 t/s at 5K to 26.18 t/s at 100K, showing predictable long-context degradation but still usable throughput.
–The setup relies on partial CPU offload (`-ngl 99`, model not fully in VRAM), so portability depends heavily on having large, fast system RAM.
–Compared with recent Strix Halo community tests, this supports the idea that unified-memory systems can improve headroom, but may not automatically unlock dramatically larger model classes.
–For buyers deciding between integrated high-memory systems and discrete GPUs (like RX 7900 XTX), this benchmark reinforces that workload profile (context length vs model size vs quant quality) should drive the upgrade path.

// TAGS

qwen3.5-35b-a3bllmbenchmarkgpuinferenceself-hostedopen-weights

DISCOVERED

117d ago

2026-03-17

PUBLISHED

117d ago

2026-03-17

RELEVANCE

8/ 10

AUTHOR

External_Dentist1928

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS3h ago

Codex speed trumps reasoning for daily tasks

Tech commentator Riley Brown highlights that for 99% of routine tasks, AI models do not need to become smarter; instead, they need to run significantly faster. Running OpenAI Codex models like GPT-5.6 Sol at 5x speed on Cerebras' wafer-scale hardware demonstrates how ultra-low latency can eliminate cognitive bottlenecks.

VIDEO3h ago

Terrain Diffusion is an open-source framework that applies diffusion models to infinite procedural terrain generation, serving as a real-time, high-fidelity successor to Perlin noise.

Terrain Diffusion (also known as InfiniteDiffusion) is an open-source framework that bridges learned fidelity and procedural utility for open-world terrain generation. As a successor to traditional noise functions like Perlin noise, it achieves real-time interactive generation on consumer GPUs and has been integrated into a playable Minecraft mod, demonstrating its capability to construct infinite, geological worlds in real time.

NEWS4h ago

OpenAI, xAI, Meta drop major models

The AI model landscape saw unprecedented rapid shifts over a 96-hour period. OpenAI released the GPT-5.6 family to general availability, xAI took Grok 4.5 public following the SpaceX merger, and Meta introduced a new paid Model API, marking significant paradigm shifts across major AI players.