BACK_TO_FEEDAICRIER_2
Gemma 4 31B 3-bit MLX trims Mac RAM
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoMODEL RELEASE

Gemma 4 31B 3-bit MLX trims Mac RAM

This release is a mixed-precision MLX conversion of Google’s Gemma 4 31B instruction model, with 5-bit embeddings and 3-bit weights elsewhere, targeting Apple Silicon users who want to run a large text-only model in less RAM. The model card lists a ~13.8 GB output size, recommends standard sampling settings, and includes LM Studio reasoning-parsing instructions for “thinking” output.

// ANALYSIS

Hot take: this is a practical niche quant, not a general-purpose win. If you want Gemma 4 on a constrained Mac and do not care about vision, the size/runtime tradeoff is the whole story.

  • The quantization scheme is straightforward and legible: 5-bit embeddings plus 3-bit elsewhere.
  • The author’s positioning is clear: text-only local inference for RAM-poor Mac users, not a multimodal demo.
  • The claimed ~13.8 GB footprint makes the 31B class model more reachable on 24 GB machines, but the real value depends on your runtime and context length.
  • The LM Studio reasoning template notes are useful operationally, since Gemma 4’s thinking mode needs the right start/end markers.
  • The “faster than other 3-bit MLX builds” claim is worth treating as a post-level benchmark claim unless you reproduce it yourself.
// TAGS
gemma4mlxquantizationapple-siliconmacoslocal-llmhugging-facellm

DISCOVERED

3h ago

2026-04-28

PUBLISHED

5h ago

2026-04-28

RELEVANCE

8/ 10

AUTHOR

JLeonsarmiento