JANGQ brings usable 2-bit MLX quantization to Apple Silicon
JANGQ (Jang Adaptive N-bit Grading) is a new open-source mixed-precision quantization framework for Apple Silicon that makes ultra-low-bit MLX inference viable by protecting sensitive attention layers at higher precision while aggressively compressing bulk MLP parameters. Where native MLX uniform 2-bit quantization produces near-unusable output, JANGQ achieves 7/10 correctness at comparable bit widths — enabling 122B+ models to run usably on Macs with 128GB unified memory.
JANGQ fills a gap that has quietly frustrated the Apple Silicon local-inference crowd: MLX's uniform quantization at 2-bit is so lossy it's been effectively unusable, leaving Mac users behind GGUF on llama.cpp in the ultra-low-bit regime. This is a direct fix.
- –The key insight is layer-sensitivity tiering: attention and output heads get 6-8 bits, MLP/expert layers get 2-3 bits — since MoE expert parameters can be 98% of total weights, protecting just the 2% attention budget costs almost nothing in memory
- –Benchmarks on M4 Max (128GB) show Qwen3.5-122B at 46 GB / 45 tok/s with JANG_1L, versus effectively broken output from MLX uniform 2-bit
- –Claims 25% memory savings vs. uniform 4-bit at comparable quality — 3.37-bit JANGQ outperforming uniform 4-bit on logit MSE
- –MLX Studio and vMLX (the companion inference front-end and engine) ship natively with JANGQ support; vMLX claims 224x faster long-context inference than LM Studio via a five-layer KV cache stack
- –Pre-quantized models already available on HuggingFace for Qwen3.5 family; conversion tooling installable via pip with one-line `jang convert` command
DISCOVERED
27d ago
2026-03-16
PUBLISHED
27d ago
2026-03-16
RELEVANCE
AUTHOR
HealthyCommunicat