OPEN_SOURCE ↗
REDDIT · REDDIT// 32d agoINFRASTRUCTURE
LocalLLaMA probes VRAM for 128K context
A LocalLLaMA post asks how to estimate VRAM required for long context windows separately from model weights, using a hypothetical Qwen 397B model at 128K context as the example. The core issue is that context length is dominated by KV-cache memory, not just parameter count and quantization.
// ANALYSIS
This is the right question for local inference, because long-context deployments usually fail on KV cache before they fail on raw model size.
- –Model weight math only tells you the static footprint; long prompts add a second memory bill for KV cache that grows with context length
- –KV-cache cost scales roughly linearly with sequence length, layer count, hidden dimensions, batch size, concurrency, and KV precision
- –Runtime defaults matter more than many users expect, since stacks like llama.cpp can reserve large context buffers unless `-c` and related settings are set explicitly
- –For giant models, 128K context can add tens or hundreds of gigabytes on top of weights, so “can I load the model?” and “can I serve the context?” are separate sizing problems
- –The lack of a simple back-of-the-envelope rule is still a tooling gap for local LLM users, which is why VRAM calculators and server-specific memory docs keep coming up in community threads
// TAGS
qwenllminferencegpu
DISCOVERED
32d ago
2026-03-10
PUBLISHED
36d ago
2026-03-06
RELEVANCE
6/ 10
AUTHOR
9r4n4y