Gemma 4 E2B-IT trims RAM, hits 35 t/s
A Reddit benchmark claims a lean Ollama Modelfile cut Gemma 4 E2B-IT's CPU-only laptop footprint from 7.4 GB to about 2 GB on an i7-1165G7 with 16 GB RAM, while lifting easy prompts into the mid-30s tokens per second. The tradeoff is sharper: shrinking context and suppressing reasoning mode improves latency, but logic-heavy prompts still regress.
This looks less like a model miracle and more like a strong reminder that default long-context settings can dominate local CPU memory use, especially on thin-and-light laptops. Gemma 4 E2B-IT may be practical on 16 GB machines, but only if you tune for the workload instead of assuming the stock config is the right operating point.
- –The 128K default context is the likely memory hog; capping `num_ctx` to 2048 plausibly cuts KV-cache pressure far more than it changes weight memory
- –The speedup is real for retrieval and extraction tasks, but the logic-puzzle failure shows the tuning is trading capability for responsiveness
- –`num_thread` matters on mobile Intel CPUs, where oversubscribing threads can waste cycles on contention instead of generation
- –The report is useful because it separates "can it run?" from "can it run well for my task?" on consumer hardware
- –If others replicate it, the interesting question is whether the gain comes mainly from context reduction, cache quantization, or better CPU scheduling
DISCOVERED
4d ago
2026-04-07
PUBLISHED
4d ago
2026-04-07
RELEVANCE
AUTHOR
Apprehensive-Scale90