YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

JANGQ brings usable 2-bit MLX quantization to Apple Silicon

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

JANGQ brings usable 2-bit MLX quantization to Apple Silicon
OPEN LINK ↗
// 74d agoOPENSOURCE RELEASE

JANGQ brings usable 2-bit MLX quantization to Apple Silicon

JANGQ (Jang Adaptive N-bit Grading) is a new open-source mixed-precision quantization framework for Apple Silicon that makes ultra-low-bit MLX inference viable by protecting sensitive attention layers at higher precision while aggressively compressing bulk MLP parameters. Where native MLX uniform 2-bit quantization produces near-unusable output, JANGQ achieves 7/10 correctness at comparable bit widths — enabling 122B+ models to run usably on Macs with 128GB unified memory.

// ANALYSIS

JANGQ fills a gap that has quietly frustrated the Apple Silicon local-inference crowd: MLX's uniform quantization at 2-bit is so lossy it's been effectively unusable, leaving Mac users behind GGUF on llama.cpp in the ultra-low-bit regime. This is a direct fix.

  • The key insight is layer-sensitivity tiering: attention and output heads get 6-8 bits, MLP/expert layers get 2-3 bits — since MoE expert parameters can be 98% of total weights, protecting just the 2% attention budget costs almost nothing in memory
  • Benchmarks on M4 Max (128GB) show Qwen3.5-122B at 46 GB / 45 tok/s with JANG_1L, versus effectively broken output from MLX uniform 2-bit
  • Claims 25% memory savings vs. uniform 4-bit at comparable quality — 3.37-bit JANGQ outperforming uniform 4-bit on logit MSE
  • MLX Studio and vMLX (the companion inference front-end and engine) ship natively with JANGQ support; vMLX claims 224x faster long-context inference than LM Studio via a five-layer KV cache stack
  • Pre-quantized models already available on HuggingFace for Qwen3.5 family; conversion tooling installable via pip with one-line `jang convert` command
// TAGS
jangqllminferenceopen-sourceedge-aimlops

DISCOVERED

74d ago

2026-03-16

PUBLISHED

74d ago

2026-03-16

RELEVANCE

7/ 10

AUTHOR

HealthyCommunicat