ExLlamaV3 adds DFlash quantization, kernels
ExLlamaV3 v0.0.34 landed on May 9 with DFlash model quantization, lower autotune overhead, and new Triton attention kernels aimed at Gemma 4. The project keeps sharpening its core promise: more throughput from consumer GPUs without giving up flexibility.
This is the kind of release that compounds. No single feature is flashy, but the combination of quantization support, kernel work, and stall fixes is exactly how inference stacks win on real workloads.
- –DFlash now moves from draft-model optimization into the quantization pipeline, which should make the speed path more practical for wider model deployment
- –Reducing autotune stalls matters because local inference libraries often burn time in setup and kernel selection, not just raw compute
- –Gemma 4-specific Triton kernels show ExLlamaV3 is still chasing architecture-level wins instead of relying on generic CUDA shortcuts
- –The release cadence from May 2 to May 9 signals an aggressively active maintainer loop, which is a real advantage in fast-moving open-source infra
- –The strongest story remains coding and agentic workloads, where earlier DFlash benchmarks showed the biggest throughput gains
DISCOVERED
2h ago
2026-05-11
PUBLISHED
3h ago
2026-05-11
RELEVANCE
AUTHOR
Unstable_Llama