REDDIT · REDDIT// 3h agoMODEL RELEASE

Gemma 4 31B 3-bit MLX trims Mac RAM

This release is a mixed-precision MLX conversion of Google’s Gemma 4 31B instruction model, with 5-bit embeddings and 3-bit weights elsewhere, targeting Apple Silicon users who want to run a large text-only model in less RAM. The model card lists a ~13.8 GB output size, recommends standard sampling settings, and includes LM Studio reasoning-parsing instructions for “thinking” output.

// ANALYSIS

Hot take: this is a practical niche quant, not a general-purpose win. If you want Gemma 4 on a constrained Mac and do not care about vision, the size/runtime tradeoff is the whole story.

–The quantization scheme is straightforward and legible: 5-bit embeddings plus 3-bit elsewhere.
–The author’s positioning is clear: text-only local inference for RAM-poor Mac users, not a multimodal demo.
–The claimed ~13.8 GB footprint makes the 31B class model more reachable on 24 GB machines, but the real value depends on your runtime and context length.
–The LM Studio reasoning template notes are useful operationally, since Gemma 4’s thinking mode needs the right start/end markers.
–The “faster than other 3-bit MLX builds” claim is worth treating as a post-level benchmark claim unless you reproduce it yourself.

// TAGS

gemma4mlxquantizationapple-siliconmacoslocal-llmhugging-facellm

DISCOVERED

3h ago

2026-04-28

PUBLISHED

5h ago

2026-04-28

RELEVANCE

8/ 10

AUTHOR

JLeonsarmiento