OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoMODEL RELEASE
Gemma 4 31B 3-bit MLX trims Mac RAM
This release is a mixed-precision MLX conversion of Google’s Gemma 4 31B instruction model, with 5-bit embeddings and 3-bit weights elsewhere, targeting Apple Silicon users who want to run a large text-only model in less RAM. The model card lists a ~13.8 GB output size, recommends standard sampling settings, and includes LM Studio reasoning-parsing instructions for “thinking” output.
// ANALYSIS
Hot take: this is a practical niche quant, not a general-purpose win. If you want Gemma 4 on a constrained Mac and do not care about vision, the size/runtime tradeoff is the whole story.
- –The quantization scheme is straightforward and legible: 5-bit embeddings plus 3-bit elsewhere.
- –The author’s positioning is clear: text-only local inference for RAM-poor Mac users, not a multimodal demo.
- –The claimed ~13.8 GB footprint makes the 31B class model more reachable on 24 GB machines, but the real value depends on your runtime and context length.
- –The LM Studio reasoning template notes are useful operationally, since Gemma 4’s thinking mode needs the right start/end markers.
- –The “faster than other 3-bit MLX builds” claim is worth treating as a post-level benchmark claim unless you reproduce it yourself.
// TAGS
gemma4mlxquantizationapple-siliconmacoslocal-llmhugging-facellm
DISCOVERED
3h ago
2026-04-28
PUBLISHED
5h ago
2026-04-28
RELEVANCE
8/ 10
AUTHOR
JLeonsarmiento