mlx-kld benchmarks oQ, Q, MXFP, UD quants
mlx-kld is a KL-divergence benchmark for comparing MLX quantization schemes against a bf16 reference on real model outputs. The linked results suggest quantization quality is highly format- and architecture-dependent, which makes KLD a more useful lens than raw bit-width alone.
Hot take: this is the right way to evaluate MLX quantization. Once you look at divergence from the reference distribution instead of just memory savings, it becomes obvious that “4-bit vs 6-bit” is too crude a shortcut.
- –KLD is a cleaner signal than perplexity for quantization damage because it isolates the effect of the quantizer itself.
- –The useful takeaway is not a single winner, but that oQ, native Q, MXFP, and UD can trade places depending on the model and architecture.
- –For dense models, higher-bit or better-targeted 6-bit schemes look like the safest default; for MoE models, router-sensitive tensors can dominate the outcome.
- –The benchmark reinforces that MLX users should treat quantization as a per-model decision, not a one-size-fits-all setting.
- –The project is also practical: caching reference log-probs makes this kind of comparison feasible on Apple Silicon instead of purely theoretical.
DISCOVERED
45d ago
2026-04-30
PUBLISHED
45d ago
2026-04-29
RELEVANCE
AUTHOR
dpswt