Qwen3.5-35B-A3B strains RTX 4090, RAM load expected
A LocalLLaMA user asks whether the memory footprint they see while running Qwen3.5-35B-A3B on an RTX 4090 is expected, and whether the model is also using system RAM. The post asks if that footprint is normal for a large Qwen checkpoint with a very large default context window.
Some RAM use is likely normal here; the A3B suffix suggests a sparse setup, so active-path size is only part of the memory story. The official model card shows a 262,144-token default context and serving recipes that assume tensor parallel on 8 GPUs, which is a strong signal that single-card runs are memory-constrained. If the backend is offloading weights or KV cache, host RAM use is expected rather than suspicious. For newcomers, the important knobs are quantization, context length, and backend choice, not just the GPU model. The post is a useful sanity check: big model on a 4090 usually means compromises, not a bug.
DISCOVERED
17d ago
2026-03-25
PUBLISHED
17d ago
2026-03-25
RELEVANCE
AUTHOR
fernandollb