OPEN_SOURCE ↗
REDDIT · REDDIT// 21d agoINFRASTRUCTURE
llama.cpp server freezes on larger Qwen quants
A Reddit user says `llama-server` can handle Qwen3.5-397B-A17B-IQ2_XXS on a 64GB VRAM + 96GB RAM box, but the 133GB IQ2_M build freezes the web UI. The setup uses Vulkan, `--no-mmap`, and a small context, so the failure looks like loader or allocator headroom rather than raw capacity alone.
// ANALYSIS
This smells like a loader-path ceiling disguised as a quant problem. `llama.cpp` can span RAM + VRAM, but `--no-mmap` and Vulkan offload make the practical headroom much tighter than the file-size math suggests.
- –bartowski lists IQ2_XXS at 106.57GB and IQ2_M at 133.28GB, so the jump is big enough to cross a staging threshold even on a 160GB nominal pool once OS, driver, and allocator overhead are counted.
- –`--no-mmap` removes the lazy paging path, so the host-RAM ceiling matters a lot more than the combined RAM + VRAM arithmetic implies.
- –Context-size and KV-cache tweaks only affect runtime cache, not the weight load that is already on the edge.
- –AMD/Vulkan multi-GPU offload has recent allocation-failure reports on second-GPU paths, so the MI60 pair is a plausible weak spot here.
// TAGS
llminferencegpuself-hostedopen-sourcellama-cpp
DISCOVERED
21d ago
2026-03-22
PUBLISHED
21d ago
2026-03-22
RELEVANCE
8/ 10
AUTHOR
Salaja