YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

llama.cpp server freezes on larger Qwen quants

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

llama.cpp server freezes on larger Qwen quants
OPEN LINK ↗
// 80d agoINFRASTRUCTURE

llama.cpp server freezes on larger Qwen quants

A Reddit user says `llama-server` can handle Qwen3.5-397B-A17B-IQ2_XXS on a 64GB VRAM + 96GB RAM box, but the 133GB IQ2_M build freezes the web UI. The setup uses Vulkan, `--no-mmap`, and a small context, so the failure looks like loader or allocator headroom rather than raw capacity alone.

// ANALYSIS

This smells like a loader-path ceiling disguised as a quant problem. `llama.cpp` can span RAM + VRAM, but `--no-mmap` and Vulkan offload make the practical headroom much tighter than the file-size math suggests.

  • bartowski lists IQ2_XXS at 106.57GB and IQ2_M at 133.28GB, so the jump is big enough to cross a staging threshold even on a 160GB nominal pool once OS, driver, and allocator overhead are counted.
  • `--no-mmap` removes the lazy paging path, so the host-RAM ceiling matters a lot more than the combined RAM + VRAM arithmetic implies.
  • Context-size and KV-cache tweaks only affect runtime cache, not the weight load that is already on the edge.
  • AMD/Vulkan multi-GPU offload has recent allocation-failure reports on second-GPU paths, so the MI60 pair is a plausible weak spot here.
// TAGS
llminferencegpuself-hostedopen-sourcellama-cpp

DISCOVERED

80d ago

2026-03-22

PUBLISHED

80d ago

2026-03-22

RELEVANCE

8/ 10

AUTHOR

Salaja