Valkyr adds optional TurboQuant inference path
Valkyr is a cross-vendor LLM inference runtime ported from TRiP-style math into Zig with Vulkan compute shaders, aimed at running on AMD, Intel, NVIDIA, Apple via MoltenVK, and Android without CUDA lock-in. The new announcement says it now includes an optional TurboQuant path for KV-cache compression, with TQ4 V-cache support, bit-exact packing validation, and reported 120 tok/s on an RTX 3090 for Gemma 2B in the baseline path. The repo frames this as an open-source, vendor-agnostic runtime that preserves parity against HuggingFace and ships a practical Vulkan-first implementation rather than a research demo.
Hot take: this is less a flashy model launch and more a serious systems-engineering release for local LLM inference, especially if you care about portability and cache compression over CUDA-specific performance tricks.
- –The strongest differentiator is the Vulkan backend: one SPIR-V path across vendors, including MoltenVK on Apple.
- –TurboQuant is positioned as optional and production-minded, with Algorithm 1 only, asymmetric K=fp / V=TQ4 by default, and correctness checks called out explicitly.
- –The reported 36 MiB to 4.6 MiB V-cache reduction on Gemma 2B is the most concrete user-facing win.
- –The repo claims four-tier parity against HuggingFace and bit-exact TQ4 packing versus CPU/Python references, which makes the benchmark claims more credible than a typical “fast inference” post.
- –This will appeal most to local-LLM practitioners, GPU kernel hackers, and anyone trying to avoid CUDA lock-in.
DISCOVERED
3h ago
2026-04-29
PUBLISHED
4h ago
2026-04-29
RELEVANCE
AUTHOR
inigid