OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoINFRASTRUCTURE
Ollama users hit Devstral VRAM gap
A LocalLLaMA thread compares Ollama’s reported VRAM use for Gemma 4 26B and Devstral Small 2 at 262K context, with Devstral initially showing more than 80GB despite similar Q4 model size. The likely culprit is not raw weights but long-context runtime memory, especially KV cache behavior, Flash Attention settings, and dense versus MoE architecture.
// ANALYSIS
This is a useful reminder that “24B Q4” is not a memory budget; context length and attention implementation can dominate local inference costs.
- –Devstral Small 2 is a dense 24B coding model with 256K context, so large-context serving can allocate a huge KV cache even when weights are quantized.
- –Gemma 4 26B is an MoE model with about 4B active parameters per token, which changes runtime characteristics versus a dense model of similar nominal size.
- –Re-enabling Flash Attention reducing the gap tracks with the diagnosis: attention/KV handling, not just model file size, is driving the surprise.
- –For local agent workflows, operators should treat `context_length`, KV cache precision, Flash Attention, and model architecture as first-class deployment parameters.
// TAGS
ollamadevstral-small-2gemma-4inferencegpuself-hostedllm
DISCOVERED
5h ago
2026-04-22
PUBLISHED
5h ago
2026-04-21
RELEVANCE
7/ 10
AUTHOR
malcolm-maya