BACK_TO_FEEDAICRIER_2
JANGQ brings usable 2-bit MLX quantization to Apple Silicon
OPEN_SOURCE ↗
REDDIT · REDDIT// 27d agoOPENSOURCE RELEASE

JANGQ brings usable 2-bit MLX quantization to Apple Silicon

JANGQ (Jang Adaptive N-bit Grading) is a new open-source mixed-precision quantization framework for Apple Silicon that makes ultra-low-bit MLX inference viable by protecting sensitive attention layers at higher precision while aggressively compressing bulk MLP parameters. Where native MLX uniform 2-bit quantization produces near-unusable output, JANGQ achieves 7/10 correctness at comparable bit widths — enabling 122B+ models to run usably on Macs with 128GB unified memory.

// ANALYSIS

JANGQ fills a gap that has quietly frustrated the Apple Silicon local-inference crowd: MLX's uniform quantization at 2-bit is so lossy it's been effectively unusable, leaving Mac users behind GGUF on llama.cpp in the ultra-low-bit regime. This is a direct fix.

  • The key insight is layer-sensitivity tiering: attention and output heads get 6-8 bits, MLP/expert layers get 2-3 bits — since MoE expert parameters can be 98% of total weights, protecting just the 2% attention budget costs almost nothing in memory
  • Benchmarks on M4 Max (128GB) show Qwen3.5-122B at 46 GB / 45 tok/s with JANG_1L, versus effectively broken output from MLX uniform 2-bit
  • Claims 25% memory savings vs. uniform 4-bit at comparable quality — 3.37-bit JANGQ outperforming uniform 4-bit on logit MSE
  • MLX Studio and vMLX (the companion inference front-end and engine) ship natively with JANGQ support; vMLX claims 224x faster long-context inference than LM Studio via a five-layer KV cache stack
  • Pre-quantized models already available on HuggingFace for Qwen3.5 family; conversion tooling installable via pip with one-line `jang convert` command
// TAGS
jangqllminferenceopen-sourceedge-aimlops

DISCOVERED

27d ago

2026-03-16

PUBLISHED

27d ago

2026-03-16

RELEVANCE

7/ 10

AUTHOR

HealthyCommunicat