Qwen, Kimi, GLM test 5090 limits

// 94d agoTUTORIAL

Qwen, Kimi, GLM test 5090 limits

A LocalLLaMA user asks how far a 5090 with 64GB of RAM can stretch across modern open-weight models. The practical answer is that 30B-class models are comfortable, 60B-class models are plausible with quantization, and 300B-class dense models are far beyond what one card can handle cleanly.

// ANALYSIS

Quantization helps a lot, but it does not change the basic math: once you move into 300B dense territory, a single 64GB GPU runs out of headroom fast. The real nuance is that MoE models can look enormous on paper while still having a much smaller active-parameter footprint at inference.

–Qwen’s official repo shows 72B int4 using about 48.9GB, which fits on 64GB only with limited room left for context, KV cache, and runtime overhead
–60B-class dense models are the sensible upper tier for this setup if you want decent speed and fewer OOM headaches
–300B dense models would need roughly 150GB just for 4-bit weights before cache and allocator overhead, so they are not realistic on one consumer GPU
–MoE models like Kimi K2 are easier to misread: 1T total parameters sounds impossible, but 32B active parameters makes the runtime story much closer to a large 30B-class model
–The bottleneck after weights is context length, so long prompts and long chats can eat the extra memory you thought quantization bought you

// TAGS

qwenkimiglmllmquantizationgpuinference

DISCOVERED

94d ago

2026-04-09

PUBLISHED

94d ago

2026-04-09

RELEVANCE

8/ 10

AUTHOR

Huge_Case4509

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS26m ago

Codex speed trumps reasoning for daily tasks

Tech commentator Riley Brown highlights that for 99% of routine tasks, AI models do not need to become smarter; instead, they need to run significantly faster. Running OpenAI Codex models like GPT-5.6 Sol at 5x speed on Cerebras' wafer-scale hardware demonstrates how ultra-low latency can eliminate cognitive bottlenecks.

VIDEO27m ago

Terrain Diffusion is an open-source framework that applies diffusion models to infinite procedural terrain generation, serving as a real-time, high-fidelity successor to Perlin noise.

Terrain Diffusion (also known as InfiniteDiffusion) is an open-source framework that bridges learned fidelity and procedural utility for open-world terrain generation. As a successor to traditional noise functions like Perlin noise, it achieves real-time interactive generation on consumer GPUs and has been integrated into a playable Minecraft mod, demonstrating its capability to construct infinite, geological worlds in real time.

NEWS1h ago

OpenAI, xAI, Meta drop major models

The AI model landscape saw unprecedented rapid shifts over a 96-hour period. OpenAI released the GPT-5.6 family to general availability, xAI took Grok 4.5 public following the SpaceX merger, and Meta introduced a new paid Model API, marking significant paradigm shifts across major AI players.