LocalLLaMA probes VRAM for 128K context

// 124d agoINFRASTRUCTURE

LocalLLaMA probes VRAM for 128K context

A LocalLLaMA post asks how to estimate VRAM required for long context windows separately from model weights, using a hypothetical Qwen 397B model at 128K context as the example. The core issue is that context length is dominated by KV-cache memory, not just parameter count and quantization.

// ANALYSIS

This is the right question for local inference, because long-context deployments usually fail on KV cache before they fail on raw model size.

–Model weight math only tells you the static footprint; long prompts add a second memory bill for KV cache that grows with context length
–KV-cache cost scales roughly linearly with sequence length, layer count, hidden dimensions, batch size, concurrency, and KV precision
–Runtime defaults matter more than many users expect, since stacks like llama.cpp can reserve large context buffers unless `-c` and related settings are set explicitly
–For giant models, 128K context can add tens or hundreds of gigabytes on top of weights, so “can I load the model?” and “can I serve the context?” are separate sizing problems
–The lack of a simple back-of-the-envelope rule is still a tooling gap for local LLM users, which is why VRAM calculators and server-specific memory docs keep coming up in community threads

// TAGS

qwenllminferencegpu

DISCOVERED

124d ago

2026-03-10

PUBLISHED

128d ago

2026-03-06

RELEVANCE

6/ 10

AUTHOR

9r4n4y

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE3h ago

git/star-history-chart embeds star charts in READMEs

git/star-history-chart is a skill for the Claude Code Templates CLI that generates a repository's star history chart as an SVG and embeds it in the README. The system uses the repository's native GITHUB_TOKEN to fetch stargazer data via a GitHub Actions workflow and commits the output directly, eliminating the need for third-party services or external secret configurations.

OPEN SOURCE3h ago

AI Content Factory automates video ads

AI Content Factory is an open-source workflow that automates bulk marketing video generation from a product catalog. Built on the Archon agentic engine and Higgsfield CLI, it reduces costs by gating expensive video rendering behind cheap image exploration and human approval.

VIDEO3h ago

Higgsfield drops developer CLI and MCP server

Higgsfield has launched a developer CLI and MCP server, allowing programmers and autonomous agents to programmatically trigger, customize, and edit marketing ads and cinematic videos directly through terminal commands. Demonstrated by developer Cole Medin using Anthropic's Claude Code and the Archon workflow engine, the toolkit enables fully automated video production pipelines.