YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Ollama users hit Devstral VRAM gap

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Ollama users hit Devstral VRAM gap
OPEN LINK ↗
// 45d agoINFRASTRUCTURE

Ollama users hit Devstral VRAM gap

A LocalLLaMA thread compares Ollama’s reported VRAM use for Gemma 4 26B and Devstral Small 2 at 262K context, with Devstral initially showing more than 80GB despite similar Q4 model size. The likely culprit is not raw weights but long-context runtime memory, especially KV cache behavior, Flash Attention settings, and dense versus MoE architecture.

// ANALYSIS

This is a useful reminder that “24B Q4” is not a memory budget; context length and attention implementation can dominate local inference costs.

  • Devstral Small 2 is a dense 24B coding model with 256K context, so large-context serving can allocate a huge KV cache even when weights are quantized.
  • Gemma 4 26B is an MoE model with about 4B active parameters per token, which changes runtime characteristics versus a dense model of similar nominal size.
  • Re-enabling Flash Attention reducing the gap tracks with the diagnosis: attention/KV handling, not just model file size, is driving the surprise.
  • For local agent workflows, operators should treat `context_length`, KV cache precision, Flash Attention, and model architecture as first-class deployment parameters.
// TAGS
ollamadevstral-small-2gemma-4inferencegpuself-hostedllm

DISCOVERED

45d ago

2026-04-22

PUBLISHED

45d ago

2026-04-21

RELEVANCE

7/ 10

AUTHOR

malcolm-maya