OPEN_SOURCE ↗
REDDIT · REDDIT// 25d agoBENCHMARK RESULT
Krasis posts major Qwen MoE gains over llama.cpp
Krasis’s latest runtime update claims much faster local inference for large Qwen MoE models by moving both prefill and decode onto GPU with separate optimization paths, while reducing system RAM pressure versus its earlier approach. The project is open-source on GitHub, exposes an OpenAI-compatible server, and targets running oversized models on consumer cards like RTX 5080/5090.
// ANALYSIS
This is a serious shot at the local-inference status quo: if these benchmarks reproduce broadly, Krasis could become a go-to runtime for VRAM-constrained MoE workloads, but validation quality will decide whether it’s breakout or hype.
- –The author reports up to 8.9x prefill and 4.7x decode improvements versus llama.cpp on Qwen3.5-122B-A10B under PCIe 4.0 constraints.
- –Claimed 16GB 5080 performance on Qwen3-Coder-Next beating a 32GB 5090 llama.cpp setup reinforces Krasis’s “stream through GPU, avoid CPU bottlenecks” design.
- –In-thread feedback challenges some llama.cpp baseline settings, so independent apples-to-apples runs are the most important next proof point.
- –Current support centers on Qwen MoE variants, with Nemotron support called out as the next expansion target.
- –OpenAI-compatible API plus planned Opencode/Aider integration makes this more than a benchmark demo; it’s aiming for developer workflow fit.
// TAGS
krasisllminferencegpubenchmarkopen-sourceself-hostedllama-cpp
DISCOVERED
25d ago
2026-03-17
PUBLISHED
25d ago
2026-03-17
RELEVANCE
8/ 10
AUTHOR
mrstoatey