REDDIT · REDDIT// 3h agoOPENSOURCE RELEASE

Valkyr adds optional TurboQuant inference path

Valkyr is a cross-vendor LLM inference runtime ported from TRiP-style math into Zig with Vulkan compute shaders, aimed at running on AMD, Intel, NVIDIA, Apple via MoltenVK, and Android without CUDA lock-in. The new announcement says it now includes an optional TurboQuant path for KV-cache compression, with TQ4 V-cache support, bit-exact packing validation, and reported 120 tok/s on an RTX 3090 for Gemma 2B in the baseline path. The repo frames this as an open-source, vendor-agnostic runtime that preserves parity against HuggingFace and ships a practical Vulkan-first implementation rather than a research demo.

// ANALYSIS

Hot take: this is less a flashy model launch and more a serious systems-engineering release for local LLM inference, especially if you care about portability and cache compression over CUDA-specific performance tricks.

–The strongest differentiator is the Vulkan backend: one SPIR-V path across vendors, including MoltenVK on Apple.
–TurboQuant is positioned as optional and production-minded, with Algorithm 1 only, asymmetric K=fp / V=TQ4 by default, and correctness checks called out explicitly.
–The reported 36 MiB to 4.6 MiB V-cache reduction on Gemma 2B is the most concrete user-facing win.
–The repo claims four-tier parity against HuggingFace and bit-exact TQ4 packing versus CPU/Python references, which makes the benchmark claims more credible than a typical “fast inference” post.
–This will appeal most to local-LLM practitioners, GPU kernel hackers, and anyone trying to avoid CUDA lock-in.

// TAGS

vulkanzigllm inferencekv-cacheturboquantopen sourcegpu compute

DISCOVERED

3h ago

2026-04-29

PUBLISHED

4h ago

2026-04-29

RELEVANCE

8/ 10

AUTHOR

inigid