OPEN_SOURCE ↗
REDDIT · REDDIT// 4d agoTUTORIAL
Gemma 4 vision budgets confuse users
Unsloth’s Gemma 4 guide tells users to raise the visual token budget to 560 or 1120 for OCR, but the control itself lives in the model processor or serving layer, not in the prompt. The docs gap is what makes this feel unclear, even though current runtimes do expose the knob.
// ANALYSIS
The feature exists, but it’s split across stacks in a way that makes the model feel harder to use than it should.
- –In Transformers, Gemma 4’s image processor accepts `max_soft_tokens`, with supported budgets of 70, 140, 280, 560, and 1120.
- –In vLLM, the default is set at launch with `--mm-processor-kwargs '{"max_soft_tokens": 560}'`, and it can be overridden per request.
- –Higher budgets are the right choice for OCR, document parsing, handwriting, and small text; lower budgets favor captioning, screenshots, and faster multimodal work.
- –The reddit post is really surfacing a documentation problem: the advice is there, but the configuration path is buried under model, processor, and server-specific APIs.
- –Audio is a separate constraint path, not the same budget knob: it only applies on E2B/E4B and is capped at 30 seconds.
// TAGS
gemma-4llmmultimodalinferenceprompt-engineeringopen-source
DISCOVERED
4d ago
2026-04-07
PUBLISHED
5d ago
2026-04-07
RELEVANCE
9/ 10
AUTHOR
Oatilis