BACK_TO_FEEDAICRIER_2
Qwen3.6-35B-A3B turns gibberish on RAM spill
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoINFRASTRUCTURE

Qwen3.6-35B-A3B turns gibberish on RAM spill

A Reddit user says Qwen3.6-35B-A3B starts producing gibberish once llama.cpp begins offloading memory from VRAM to system RAM under CUDA Unified Memory. The post questions whether the issue is the build, the runtime flags, or a deeper unified-memory bug.

// ANALYSIS

This looks more like a llama.cpp memory-management edge case than a prompt or sampling problem. The combination of Unified Memory, aggressive fit settings, and long-context offload is exactly where backend bugs tend to surface.

  • Upstream llama.cpp has known reports that CUDA Unified Memory is flaky for models that spill beyond VRAM, especially under heavy decode/load pressure.
  • The posted config is very aggressive: `--fit`, `--fit-target 256`, `--fit-ctx 204800`, `--kv-offload`, and large batch settings all increase the chance of allocator or paging issues.
  • Qwen3.6 is a sparse MoE model, so total parameter count is not the whole story, but long-context KV cache pressure can still push a run into unstable memory behavior.
  • Recent llama.cpp chatter also suggests some IQ3_S/CUDA combinations have regressions, so changing quantization or CUDA/toolchain version is a plausible next test.
  • For developers serving large local models, this is a reminder that “fits in RAM” is not the same as “runs correctly once pages start migrating.”
// TAGS
qwen3.6-35b-a3bllama.cppllmgpuinferenceopen-source

DISCOVERED

4h ago

2026-04-19

PUBLISHED

8h ago

2026-04-19

RELEVANCE

8/ 10

AUTHOR

FiniteElemente