REDDIT · REDDIT// 4h agoOPENSOURCE RELEASE

llama.cpp adds DeepSeek v4 Flash support

This experimental fork of llama.cpp adds DeepSeek-V4-Flash support with a GGUF quantization strategy aimed at fitting the 284B MoE model on Macs with 128GB of RAM. The author says selective 2-bit quantization for routed experts, plus Q8 for shared weights, is already producing usable chat quality and around 21 tok/s on an M3 Max after Metal tuning.

// ANALYSIS

This is a strong proof-of-concept for local MoE inference, but it is still very much an experiment rather than a broadly validated release. The interesting part is not just that DeepSeek V4 Flash runs locally, but that the quantization scheme tries to preserve quality where it matters and squeeze size where it hurts least.

–The repo targets a very specific tradeoff: 2-bit routed experts, Q8 shared experts, and a GGUF build for Apple Silicon memory constraints.
–The reported 21 tok/s on an M3 Max is the real signal here: not frontier speed, but fast enough to make large local models feel practical.
–At 284B parameters, this still sits well above the normal local-LLM comfort zone, so 128GB unified memory remains a hard requirement for most users.
–The author’s quality comparison against Qwen 3.6 27B is promising, but it is anecdotal until proper benchmarks land.
–For the llama.cpp ecosystem, this is the kind of release that expands the boundary of what “runs locally” can mean, especially for MoE models.

// TAGS

llama.cppdeepseek-v4-flashllminferenceopen-sourceself-hosted

DISCOVERED

4h ago

2026-04-26

PUBLISHED

6h ago

2026-04-26

RELEVANCE

9/ 10

AUTHOR

antirez