BACK_TO_FEEDAICRIER_2
Ollama users hit Devstral VRAM gap
OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoINFRASTRUCTURE

Ollama users hit Devstral VRAM gap

A LocalLLaMA thread compares Ollama’s reported VRAM use for Gemma 4 26B and Devstral Small 2 at 262K context, with Devstral initially showing more than 80GB despite similar Q4 model size. The likely culprit is not raw weights but long-context runtime memory, especially KV cache behavior, Flash Attention settings, and dense versus MoE architecture.

// ANALYSIS

This is a useful reminder that “24B Q4” is not a memory budget; context length and attention implementation can dominate local inference costs.

  • Devstral Small 2 is a dense 24B coding model with 256K context, so large-context serving can allocate a huge KV cache even when weights are quantized.
  • Gemma 4 26B is an MoE model with about 4B active parameters per token, which changes runtime characteristics versus a dense model of similar nominal size.
  • Re-enabling Flash Attention reducing the gap tracks with the diagnosis: attention/KV handling, not just model file size, is driving the surprise.
  • For local agent workflows, operators should treat `context_length`, KV cache precision, Flash Attention, and model architecture as first-class deployment parameters.
// TAGS
ollamadevstral-small-2gemma-4inferencegpuself-hostedllm

DISCOVERED

5h ago

2026-04-22

PUBLISHED

5h ago

2026-04-21

RELEVANCE

7/ 10

AUTHOR

malcolm-maya