Ollama users hit Devstral VRAM gap

// 90d agoINFRASTRUCTURE

Ollama users hit Devstral VRAM gap

A LocalLLaMA thread compares Ollama’s reported VRAM use for Gemma 4 26B and Devstral Small 2 at 262K context, with Devstral initially showing more than 80GB despite similar Q4 model size. The likely culprit is not raw weights but long-context runtime memory, especially KV cache behavior, Flash Attention settings, and dense versus MoE architecture.

// ANALYSIS

This is a useful reminder that “24B Q4” is not a memory budget; context length and attention implementation can dominate local inference costs.

–Devstral Small 2 is a dense 24B coding model with 256K context, so large-context serving can allocate a huge KV cache even when weights are quantized.
–Gemma 4 26B is an MoE model with about 4B active parameters per token, which changes runtime characteristics versus a dense model of similar nominal size.
–Re-enabling Flash Attention reducing the gap tracks with the diagnosis: attention/KV handling, not just model file size, is driving the surprise.
–For local agent workflows, operators should treat `context_length`, KV cache precision, Flash Attention, and model architecture as first-class deployment parameters.

// TAGS

ollamadevstral-small-2gemma-4inferencegpuself-hostedllm

DISCOVERED

90d ago

2026-04-22

PUBLISHED

90d ago

2026-04-21

RELEVANCE

7/ 10

AUTHOR

malcolm-maya

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE38m ago

Grok Build adds 'grok doctor' for terminal diagnostics

The new grok doctor command in Grok Build allows developers to quickly diagnose problems with their terminal, tmux, clipboard, and keyboard setup without launching the TUI. The update also introduces resilient sessions that survive moving directories or switching machines, along with image support.

LAUNCH42m ago

Hermes Agent OS coordinates 30+ AI agents

Hermes Agent OS is an AI-driven mission control framework that orchestrates a collaborative network of over 30 AI agents to automate complex business workflows. It organizes agents across 14 specialized stations handling command, radar, outreach, SEO, and studio tasks, and features the Hermes Oracle to automatically track AI automation news daily.

RESEARCH1h ago

ByteDance unveils SWE-Pruner Pro for LLM context pruning

ByteDance's SWE-Pruner Pro demonstrates that coding LLMs inherently possess the capability to determine which context should be pruned. By leveraging the agent's internal representations, this approach reduces token usage by 39% while simultaneously improving performance on the SWE-Bench Verified benchmark by 3.8%.