Gemma 4-E2B STT hits Home Assistant hurdles
Google's new 2B parameter multimodal model, Gemma 4-E2B, is being repurposed for local Speech-to-Text (STT) in Home Assistant. While its accuracy is impressive, it natively outputs its internal "thought chain," requiring community-developed middleware to strip reasoning tags for raw transcriptions.
Gemma 4's multimodal capabilities make it a high-performance local STT contender, but its "thoughtful" default behavior is currently a friction point for simple transcription tasks.
- –Native audio support in a 2-billion parameter model allows for low-latency, high-accuracy STT on consumer GPUs, rivaling dedicated models like Parakeet.
- –The model’s built-in reasoning engine, while valuable for complex prompts, lacks a reliable server-side "off" switch in current llama.cpp and llama-swap implementations.
- –Community members are bypassing the problem with custom FastAPI middleware that regex-strips <|channel>thought tags before the data reaches Home Assistant.
- –This integration highlights the growing trend of using general-purpose multimodal LLMs as high-performance drop-in replacements for traditional specialized audio encoders.
- –The combination of llama-swap and wyoming_openai remains the dominant architecture for bridging local LLM servers to the Home Assistant "Assist" pipeline.
DISCOVERED
45d ago
2026-04-18
PUBLISHED
45d ago
2026-04-17
RELEVANCE
AUTHOR
andy2na