BACK_TO_FEEDAICRIER_2
Gemma 4 vision budgets confuse users
OPEN_SOURCE ↗
REDDIT · REDDIT// 4d agoTUTORIAL

Gemma 4 vision budgets confuse users

Unsloth’s Gemma 4 guide tells users to raise the visual token budget to 560 or 1120 for OCR, but the control itself lives in the model processor or serving layer, not in the prompt. The docs gap is what makes this feel unclear, even though current runtimes do expose the knob.

// ANALYSIS

The feature exists, but it’s split across stacks in a way that makes the model feel harder to use than it should.

  • In Transformers, Gemma 4’s image processor accepts `max_soft_tokens`, with supported budgets of 70, 140, 280, 560, and 1120.
  • In vLLM, the default is set at launch with `--mm-processor-kwargs '{"max_soft_tokens": 560}'`, and it can be overridden per request.
  • Higher budgets are the right choice for OCR, document parsing, handwriting, and small text; lower budgets favor captioning, screenshots, and faster multimodal work.
  • The reddit post is really surfacing a documentation problem: the advice is there, but the configuration path is buried under model, processor, and server-specific APIs.
  • Audio is a separate constraint path, not the same budget knob: it only applies on E2B/E4B and is capped at 30 seconds.
// TAGS
gemma-4llmmultimodalinferenceprompt-engineeringopen-source

DISCOVERED

4d ago

2026-04-07

PUBLISHED

5d ago

2026-04-07

RELEVANCE

9/ 10

AUTHOR

Oatilis