Qwen3.5 Q3 Hits Long-Context Wall
A LocalLLaMA user reports Qwen3.5-122B-A10B in Q3_K_XL stays strong for coding until roughly 75-80K tokens, then degrades abruptly with hallucinations and confusion. The model itself supports 262K native context, so this looks more like a quantization-and-serving stability issue than a hard context-limit problem.
This reads like a real long-context cliff, not just normal “more tokens, slightly worse answers” drift. The model is still well below its advertised context ceiling, which points the finger at low-bit weights, prompt accumulation, and session management rather than raw window size alone.
- –Qwen3.5-122B-A10B is a MoE model with 262,144 native context and official guidance to keep at least 128K for preserving thinking quality, so 75-80K should not be inherently dangerous
- –The abrupt failure pattern is consistent with quantization stress under long-context retrieval, and the thread’s replies echo that lower quants can diverge from higher-precision runs over long sessions
- –BF16 KV cache helps memory fidelity, but it does not fix weight-quantization loss in attention, routing, and token selection
- –The current sampling stack is fairly sharp already; more aggressive penalties can make a model feel more erratic once context quality starts slipping
- –The practical fix is the one the poster already found: compact early, keep a running summary, and if possible move to a sturdier quant or a denser model for very long coding sessions
DISCOVERED
1h ago
2026-05-26
PUBLISHED
4h ago
2026-05-26
RELEVANCE
AUTHOR
_TheWolfOfWalmart_