llama.cpp server freezes on larger Qwen quants

// 80d agoINFRASTRUCTURE

llama.cpp server freezes on larger Qwen quants

A Reddit user says `llama-server` can handle Qwen3.5-397B-A17B-IQ2_XXS on a 64GB VRAM + 96GB RAM box, but the 133GB IQ2_M build freezes the web UI. The setup uses Vulkan, `--no-mmap`, and a small context, so the failure looks like loader or allocator headroom rather than raw capacity alone.

// ANALYSIS

This smells like a loader-path ceiling disguised as a quant problem. `llama.cpp` can span RAM + VRAM, but `--no-mmap` and Vulkan offload make the practical headroom much tighter than the file-size math suggests.

–bartowski lists IQ2_XXS at 106.57GB and IQ2_M at 133.28GB, so the jump is big enough to cross a staging threshold even on a 160GB nominal pool once OS, driver, and allocator overhead are counted.
–`--no-mmap` removes the lazy paging path, so the host-RAM ceiling matters a lot more than the combined RAM + VRAM arithmetic implies.
–Context-size and KV-cache tweaks only affect runtime cache, not the weight load that is already on the edge.
–AMD/Vulkan multi-GPU offload has recent allocation-failure reports on second-GPU paths, so the MI60 pair is a plausible weak spot here.

// TAGS

llminferencegpuself-hostedopen-sourcellama-cpp

DISCOVERED

80d ago

2026-03-22

PUBLISHED

80d ago

2026-03-22

RELEVANCE

8/ 10

AUTHOR

Salaja

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS29m ago

Claude Fable 5 tops 5.5 in data analysis

In a recent post on X, user Theo expressed intense enthusiasm about the data analysis capabilities of an AI model called Fable. By stating it is "WAY better than 5.5," the user implies a significant generational leap in performance over what is likely a major foundational model, suggesting Fable is exceptionally well-suited for complex data tasks.

MODEL1h ago

Claude Fable 5 launch sparks massive developer backlash

Anthropic's Claude Fable 5 launch faces severe developer backlash over aggressive safety restrictions, high pricing, and a forced 30-day data retention policy. The model silently routes chemistry, biology, and cybersecurity requests to the older Opus 4.8 model, frustrating users with opaque downgrades and anti-distillation blocks.

MODEL1h ago

Designers praise Claude Fable 5 landing pages

Educator and designer Meng To highlighted Claude Fable 5's capability for creating landing pages on X, calling the model "a monster" for the task. Released in June 2026, Claude Fable 5 is Anthropic's latest Mythos-class AI model, featuring a 1-million-token context window, a 128,000-token output capacity, and advanced reasoning for long-horizon agentic workflows, making it highly effective for complex design and front-end code generation tasks.