OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoINFRASTRUCTURE
Llama.cpp enables MXFP4 on older NVIDIA GPUs
A new update to llama.cpp enables support for MXFP4 (Microscaling Floating Point 4-bit) quantization on older NVIDIA architectures, bypassing the previous requirement for Blackwell hardware. Users are leveraging this to run high-performance sparse MoE models like Qwen 3.6 35B A3B on consumer cards like the RTX 3080.
// ANALYSIS
Software-level emulation of Blackwell-exclusive features democratizes high-efficiency inference for the massive installed base of older GPUs.
- –MXFP4 provides a superior perplexity-to-size ratio compared to standard 4-bit GGUF quants, making it a "sweet spot" for mid-sized models.
- –On Ampere and Ada cards, performance relies on software dequantization (DP4A) rather than hardware acceleration, trading some inference speed for significantly better model quality.
- –The Qwen 3.6 35B A3B model's sparse architecture (activating only 3B parameters) makes it uniquely viable for 10GB-12GB VRAM cards when paired with this format.
- –While native Blackwell support yields a 25-33% speedup, the open-source community's ability to backport these formats ensures hardware longevity for hobbyists.
- –Developers should still benchmark against IQ4_XS, as the software overhead of MXFP4 on older cards can vary significantly depending on the specific kernel implementation.
// TAGS
llama-cppgpuinferencellmopen-sourcemxfp4qwen
DISCOVERED
4h ago
2026-04-21
PUBLISHED
4h ago
2026-04-21
RELEVANCE
8/ 10
AUTHOR
autisticit