ExLlamaV3 adds DFlash quantization, kernels

// 2h agoOPENSOURCE RELEASE

ExLlamaV3 adds DFlash quantization, kernels

ExLlamaV3 v0.0.34 landed on May 9 with DFlash model quantization, lower autotune overhead, and new Triton attention kernels aimed at Gemma 4. The project keeps sharpening its core promise: more throughput from consumer GPUs without giving up flexibility.

// ANALYSIS

This is the kind of release that compounds. No single feature is flashy, but the combination of quantization support, kernel work, and stall fixes is exactly how inference stacks win on real workloads.

–DFlash now moves from draft-model optimization into the quantization pipeline, which should make the speed path more practical for wider model deployment
–Reducing autotune stalls matters because local inference libraries often burn time in setup and kernel selection, not just raw compute
–Gemma 4-specific Triton kernels show ExLlamaV3 is still chasing architecture-level wins instead of relying on generic CUDA shortcuts
–The release cadence from May 2 to May 9 signals an aggressively active maintainer loop, which is a real advantage in fast-moving open-source infra
–The strongest story remains coding and agentic workloads, where earlier DFlash benchmarks showed the biggest throughput gains

// TAGS

exllamav3open-sourceinferencequantizationgpullmdevtool

DISCOVERED

2h ago

2026-05-11

PUBLISHED

3h ago

2026-05-11

RELEVANCE

9/ 10

AUTHOR

Unstable_Llama

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS2h ago

Claude case study shows solo operator power

An X post highlights a Hong Kong marketer who quit his job, went deep on AI, and reportedly used Claude to generate $360,000 in about a year. It reads more like social proof for Claude than a product announcement.

OPEN SOURCE2h ago

Hermes Agent powers agentic OS pitch

The video frames Hermes Agent as the persistent engine behind an agentic-OS-style setup: long-lived memory, reusable skills, multi-agent coordination, and hands-off workflows across chat and terminal surfaces. That pitch matches Hermes Agent v0.13.0, which shipped May 7 with durable Kanban-style task orchestration and stronger session persistence.

OPEN SOURCE2h ago

AionUi turns Hermes into agentic OS

AionUi is a local, open-source desktop app for coordinating AI agents across an entire machine. It positions itself as the UI and control plane for everyday agent work, with built-in assistants, skills, MCP support, remote access, scheduled automation, and native office-file workflows for slides, spreadsheets, documents, and PDFs.