Gemma 4 vision budgets confuse users

// 97d agoTUTORIAL

Gemma 4 vision budgets confuse users

Unsloth’s Gemma 4 guide tells users to raise the visual token budget to 560 or 1120 for OCR, but the control itself lives in the model processor or serving layer, not in the prompt. The docs gap is what makes this feel unclear, even though current runtimes do expose the knob.

// ANALYSIS

The feature exists, but it’s split across stacks in a way that makes the model feel harder to use than it should.

–In Transformers, Gemma 4’s image processor accepts `max_soft_tokens`, with supported budgets of 70, 140, 280, 560, and 1120.
–In vLLM, the default is set at launch with `--mm-processor-kwargs '{"max_soft_tokens": 560}'`, and it can be overridden per request.
–Higher budgets are the right choice for OCR, document parsing, handwriting, and small text; lower budgets favor captioning, screenshots, and faster multimodal work.
–The reddit post is really surfacing a documentation problem: the advice is there, but the configuration path is buried under model, processor, and server-specific APIs.
–Audio is a separate constraint path, not the same budget knob: it only applies on E2B/E4B and is capped at 30 seconds.

// TAGS

gemma-4llmmultimodalinferenceprompt-engineeringopen-source

DISCOVERED

97d ago

2026-04-07

PUBLISHED

97d ago

2026-04-07

RELEVANCE

9/ 10

AUTHOR

Oatilis

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL19m ago

GPT-5.6 retains reasoning context across turns

A key architectural detail has been revealed for OpenAI's new GPT-5.6 model family: unlike predecessor models that discarded Chain of Thought (CoT) context at each turn to save context window space, GPT-5.6 maintains its reasoning context across the entire conversation history. This change ensures that the model preserves its logical chain and intermediate reasoning steps throughout multi-turn interactions.

OPEN SOURCE3h ago

scroll-world launches scroll-driven 3D flight skill

scroll-world is an open-source, framework-agnostic agent skill that leverages Higgsfield to generate immersive, scroll-driven 3D camera flights through diorama scenes for landing pages. By rendering seamless connection clips between neighboring frames, it allows developers to build interactive 3D narrative websites navigated simply by scrolling, without requiring heavy game engines.

MODEL4h ago

OpenAI GPT-5.6 hits Amazon Bedrock

OpenAI's GPT-5.6 model family—including Sol, Terra, and Luna—is now generally available on Amazon Bedrock. Running on Bedrock's next-generation inference engine, the models support prompt caching with a 90% discount and match OpenAI's first-party pricing.