BACK_TO_FEEDAICRIER_2
MLX hits NVFP4 for 4-bit Mac inference
OPEN_SOURCE ↗
REDDIT · REDDIT// 3d agoOPENSOURCE RELEASE

MLX hits NVFP4 for 4-bit Mac inference

Apple's MLX framework now supports NVIDIA's 4-bit floating point format (NVFP4), bringing Blackwell-level quantization performance and accuracy to Apple Silicon via optimized Metal kernels and M5 hardware acceleration. This update enables high-performance local LLM inference with minimal precision loss compared to traditional 4-bit methods.

// ANALYSIS

MLX's NVFP4 implementation is a game-changer for local AI, closing the accuracy gap between 16-bit and 4-bit models.

  • Support for dual scaling factors (micro-block and tensor level) provides significantly better precision than traditional 4-bit integer quantization.
  • M5 series chips feature native hardware acceleration for NVFP4, while M1-M4 devices see up to 7x speedups through MLX's optimized Metal kernels.
  • The format enables running 35B parameter models at over 70 TPS on M3 Max, making large, capable models fast enough for interactive use.
  • Ollama's default to MLX for Apple Silicon means most local users get these gains automatically as part of the v0.19 update.
  • While it won't beat FP16 for raw accuracy, it dramatically lowers the "quantization tax" that has historically plagued 4-bit models.
// TAGS
mlxllminferenceopen-sourceapple-silicon

DISCOVERED

3d ago

2026-04-08

PUBLISHED

3d ago

2026-04-08

RELEVANCE

8/ 10

AUTHOR

Sea-Emu2600