OPEN_SOURCE ↗
REDDIT · REDDIT// 1d agoINFRASTRUCTURE
Qwen3.6 hangs on dual 5060 Ti
A LocalLLaMA user says Qwen3.6-27B runs fine at first, then the llama.cpp or Ollama backend becomes unresponsive after a few follow-up turns on a 2x RTX 5060 Ti 16GB rig. The likely issue is not broken hardware but an overambitious serving setup, especially the `--ctx-size 100000` setting plus growing chat history and KV-cache pressure.
// ANALYSIS
This reads like a serving/configuration problem, not a PC failure. A 27B model can absolutely run locally, but asking it to carry a 100k-token context on consumer GPUs pushes the backend into the zone where it looks frozen even while the OS stays responsive.
- –Qwen3.6-27B is a dense, long-context model, so the model itself is not the problem; the memory footprint of the active context is.
- –`llama.cpp` and Ollama both have to manage KV cache and prompt reprocessing, so each follow-up can get more expensive than the last if the conversation history keeps growing.
- –OpenWebUI plus search/RAG can silently inflate prompts, which makes the third or fourth turn much heavier than the first.
- –Two GPUs do not automatically mean better stability or throughput; sharding can add overhead, especially if the model already fits poorly with the chosen context size.
- –The practical fix is to cut context size aggressively, trim history, and test a simpler single-backend setup before assuming a hardware fault.
// TAGS
qwenllmlong-contextinferencegpuself-hostedlocal-first
DISCOVERED
1d ago
2026-05-01
PUBLISHED
1d ago
2026-05-01
RELEVANCE
8/ 10
AUTHOR
Force88