BACK_TO_FEEDAICRIER_2
Qwen, Kimi, GLM test 5090 limits
OPEN_SOURCE ↗
REDDIT · REDDIT// 3d agoTUTORIAL

Qwen, Kimi, GLM test 5090 limits

A LocalLLaMA user asks how far a 5090 with 64GB of RAM can stretch across modern open-weight models. The practical answer is that 30B-class models are comfortable, 60B-class models are plausible with quantization, and 300B-class dense models are far beyond what one card can handle cleanly.

// ANALYSIS

Quantization helps a lot, but it does not change the basic math: once you move into 300B dense territory, a single 64GB GPU runs out of headroom fast. The real nuance is that MoE models can look enormous on paper while still having a much smaller active-parameter footprint at inference.

  • Qwen’s official repo shows 72B int4 using about 48.9GB, which fits on 64GB only with limited room left for context, KV cache, and runtime overhead
  • 60B-class dense models are the sensible upper tier for this setup if you want decent speed and fewer OOM headaches
  • 300B dense models would need roughly 150GB just for 4-bit weights before cache and allocator overhead, so they are not realistic on one consumer GPU
  • MoE models like Kimi K2 are easier to misread: 1T total parameters sounds impossible, but 32B active parameters makes the runtime story much closer to a large 30B-class model
  • The bottleneck after weights is context length, so long prompts and long chats can eat the extra memory you thought quantization bought you
// TAGS
qwenkimiglmllmquantizationgpuinference

DISCOVERED

3d ago

2026-04-09

PUBLISHED

3d ago

2026-04-09

RELEVANCE

8/ 10

AUTHOR

Huge_Case4509