BACK_TO_FEEDAICRIER_2
Llama.cpp enables MXFP4 on older NVIDIA GPUs
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoINFRASTRUCTURE

Llama.cpp enables MXFP4 on older NVIDIA GPUs

A new update to llama.cpp enables support for MXFP4 (Microscaling Floating Point 4-bit) quantization on older NVIDIA architectures, bypassing the previous requirement for Blackwell hardware. Users are leveraging this to run high-performance sparse MoE models like Qwen 3.6 35B A3B on consumer cards like the RTX 3080.

// ANALYSIS

Software-level emulation of Blackwell-exclusive features democratizes high-efficiency inference for the massive installed base of older GPUs.

  • MXFP4 provides a superior perplexity-to-size ratio compared to standard 4-bit GGUF quants, making it a "sweet spot" for mid-sized models.
  • On Ampere and Ada cards, performance relies on software dequantization (DP4A) rather than hardware acceleration, trading some inference speed for significantly better model quality.
  • The Qwen 3.6 35B A3B model's sparse architecture (activating only 3B parameters) makes it uniquely viable for 10GB-12GB VRAM cards when paired with this format.
  • While native Blackwell support yields a 25-33% speedup, the open-source community's ability to backport these formats ensures hardware longevity for hobbyists.
  • Developers should still benchmark against IQ4_XS, as the software overhead of MXFP4 on older cards can vary significantly depending on the specific kernel implementation.
// TAGS
llama-cppgpuinferencellmopen-sourcemxfp4qwen

DISCOVERED

4h ago

2026-04-21

PUBLISHED

4h ago

2026-04-21

RELEVANCE

8/ 10

AUTHOR

autisticit