Qwen3.6-35B-A3B turns gibberish on RAM spill

// 90d agoINFRASTRUCTURE

Qwen3.6-35B-A3B turns gibberish on RAM spill

A Reddit user says Qwen3.6-35B-A3B starts producing gibberish once llama.cpp begins offloading memory from VRAM to system RAM under CUDA Unified Memory. The post questions whether the issue is the build, the runtime flags, or a deeper unified-memory bug.

// ANALYSIS

This looks more like a llama.cpp memory-management edge case than a prompt or sampling problem. The combination of Unified Memory, aggressive fit settings, and long-context offload is exactly where backend bugs tend to surface.

–Upstream llama.cpp has known reports that CUDA Unified Memory is flaky for models that spill beyond VRAM, especially under heavy decode/load pressure.
–The posted config is very aggressive: `--fit`, `--fit-target 256`, `--fit-ctx 204800`, `--kv-offload`, and large batch settings all increase the chance of allocator or paging issues.
–Qwen3.6 is a sparse MoE model, so total parameter count is not the whole story, but long-context KV cache pressure can still push a run into unstable memory behavior.
–Recent llama.cpp chatter also suggests some IQ3_S/CUDA combinations have regressions, so changing quantization or CUDA/toolchain version is a plausible next test.
–For developers serving large local models, this is a reminder that “fits in RAM” is not the same as “runs correctly once pages start migrating.”

// TAGS

qwen3.6-35b-a3bllama.cppllmgpuinferenceopen-source

DISCOVERED

90d ago

2026-04-19

PUBLISHED

90d ago

2026-04-19

RELEVANCE

8/ 10

AUTHOR

FiniteElemente

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL1h ago

Qwen-3.8-Max Outperforms GPT-5.6 Sol, Rivals Fable 5

The shared social media announcement highlights that Alibaba's upcoming flagship model, Qwen-3.8-Max, reportedly outperforms OpenAI's GPT-5.6 Sol and trails Anthropic's Fable 5 by only a narrow margin. This benchmark performance positions Qwen-3.8-Max as a top-tier contender in the rapidly evolving frontier model landscape of 2026, challenging traditional leaders like OpenAI and Anthropic.

MODEL2h ago

IBM Granite hits Modelers with Ascend support

IBM has released a wide range of models from its Granite family—including LoRA adapters, small vision models, speech engines, and guardrails—on the Modelers platform (modelers.cn), a major Chinese open-source repository. Every model in this release is licensed under the permissive Apache-2.0 license and features native compatibility with Huawei's Ascend NPUs, significantly lowering the barrier to deploying these open-source models on domestic Chinese AI hardware.

MODEL3h ago

Kimi K3 launch strengthens open-source case

The release of Moonshot AI's Kimi K3, an open-weights model with 2.8 trillion parameters, a 1-million-token context window, and native visual processing, has sparked discussion about the viability of proprietary frontier LLM training. As open-weights models achieve performance parity with proprietary systems on key coding and agentic benchmarks, developers and investors are increasingly questioning the massive capital requirements of closed-source frontier projects in favor of more cost-effective open alternatives.