oMLX DFlash update shows mixed Qwen3 results
Performance tests of DFlash block-diffusion speculative decoding in oMLX v0.3.5-rc1 show inconsistent results on M2 Max hardware. While Qwen3-Coder-30B-A3B achieved a 21% speedup, the smaller Qwen3.5-9B model saw a 44% slowdown due to draft model overhead.
DFlash's block-diffusion approach is a niche optimization requiring precise model-draft alignment to be effective. Code generation remains the primary use case where block-based predictions justify the overhead, whereas smaller models lack the computational headroom to benefit from the complex verification step. Additionally, compatibility issues with DeltaNet-based architectures currently lead to system crashes.
DISCOVERED
6h ago
2026-04-15
PUBLISHED
6h ago
2026-04-15
RELEVANCE
AUTHOR
CrushingLoss