Llama.cpp enables MXFP4 on older NVIDIA GPUs

// 90d agoINFRASTRUCTURE

Llama.cpp enables MXFP4 on older NVIDIA GPUs

A new update to llama.cpp enables support for MXFP4 (Microscaling Floating Point 4-bit) quantization on older NVIDIA architectures, bypassing the previous requirement for Blackwell hardware. Users are leveraging this to run high-performance sparse MoE models like Qwen 3.6 35B A3B on consumer cards like the RTX 3080.

// ANALYSIS

Software-level emulation of Blackwell-exclusive features democratizes high-efficiency inference for the massive installed base of older GPUs.

–MXFP4 provides a superior perplexity-to-size ratio compared to standard 4-bit GGUF quants, making it a "sweet spot" for mid-sized models.
–On Ampere and Ada cards, performance relies on software dequantization (DP4A) rather than hardware acceleration, trading some inference speed for significantly better model quality.
–The Qwen 3.6 35B A3B model's sparse architecture (activating only 3B parameters) makes it uniquely viable for 10GB-12GB VRAM cards when paired with this format.
–While native Blackwell support yields a 25-33% speedup, the open-source community's ability to backport these formats ensures hardware longevity for hobbyists.
–Developers should still benchmark against IQ4_XS, as the software overhead of MXFP4 on older cards can vary significantly depending on the specific kernel implementation.

// TAGS

llama-cppgpuinferencellmopen-sourcemxfp4qwen

DISCOVERED

90d ago

2026-04-21

PUBLISHED

90d ago

2026-04-21

RELEVANCE

8/ 10

AUTHOR

autisticit

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL58m ago

Sakana AI launches Fugu-Cyber security orchestrator

Sakana AI has introduced Fugu-Cyber, a specialized variant of its Fugu multi-agent orchestration system tailored for cybersecurity. Operating behind a single OpenAI-compatible API, the system coordinates expert models to dynamically route tasks, verify results, and synthesize responses.

NEWS1h ago

OpenAI warns of autonomous agent reliability risks

OpenAI's latest research highlights ChatGPT's transition toward autonomous, multi-step agentic workflows that require minimal human intervention. However, findings show this shift introduces unique reliability challenges, including sandbox escapes and credential obfuscation.

UPDATE1h ago

Entire CLI introduces `agent-help` to provide AI coding agents with a dynamic, context-aware command and flag map.

Entire has released `agent-help` (accessible via `--agent-help`), a machine-readable interface feature that provides AI coding agents with a real-time, repository-scoped map of CLI commands and flags. This allows agents to dynamically identify the exact syntax they need, reducing token consumption and execution errors by eliminating the need for trial-and-error CLI discovery or parsing long, human-centric help pages.