REDDIT · REDDIT// 1d agoINFRASTRUCTURE

Qwen3.6 hangs on dual 5060 Ti

A LocalLLaMA user says Qwen3.6-27B runs fine at first, then the llama.cpp or Ollama backend becomes unresponsive after a few follow-up turns on a 2x RTX 5060 Ti 16GB rig. The likely issue is not broken hardware but an overambitious serving setup, especially the `--ctx-size 100000` setting plus growing chat history and KV-cache pressure.

// ANALYSIS

This reads like a serving/configuration problem, not a PC failure. A 27B model can absolutely run locally, but asking it to carry a 100k-token context on consumer GPUs pushes the backend into the zone where it looks frozen even while the OS stays responsive.

–Qwen3.6-27B is a dense, long-context model, so the model itself is not the problem; the memory footprint of the active context is.
–`llama.cpp` and Ollama both have to manage KV cache and prompt reprocessing, so each follow-up can get more expensive than the last if the conversation history keeps growing.
–OpenWebUI plus search/RAG can silently inflate prompts, which makes the third or fourth turn much heavier than the first.
–Two GPUs do not automatically mean better stability or throughput; sharding can add overhead, especially if the model already fits poorly with the chosen context size.
–The practical fix is to cut context size aggressively, trim history, and test a simpler single-backend setup before assuming a hardware fault.

// TAGS

qwenllmlong-contextinferencegpuself-hostedlocal-first

DISCOVERED

1d ago

2026-05-01

PUBLISHED

1d ago

2026-05-01

RELEVANCE

8/ 10

AUTHOR

Force88