VibeVoice ExLlamaV3 fork speeds up local ASR

// 101d agoOPENSOURCE RELEASE

VibeVoice ExLlamaV3 fork speeds up local ASR

A developer released an ExLlamaV3 quantized fork of VibeVoice focused on faster local inference. The post claims q8 runs about 4x faster than fp16 with Transformers and points to both the GitHub repo and a Hugging Face model checkpoint for the fork.

// ANALYSIS

Strong practical win if you want local speech models with less latency and better throughput, especially on hardware where quantized inference matters more than absolute fidelity.

–The main value here is speed: q8 quantization plus ExLlamaV3 appears to unlock a meaningful runtime improvement over the standard Transformers fp16 path.
–This is most relevant for people already using VibeVoice locally and willing to trade some simplicity and possibly some quality for performance.
–It reads like an experimental fork, not an official upstream release, so stability, compatibility, and maintenance are the main unknowns.
–The post is light on benchmark methodology, so the “4x faster” claim should be treated as indicative rather than definitive.

// TAGS

speech-to-textquantizationexllamalocal-inferencevoice-aiopensource

DISCOVERED

101d ago

2026-04-01

PUBLISHED

101d ago

2026-04-01

RELEVANCE

6/ 10

AUTHOR

daLazyModder

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL17m ago

Qwythos-9B v2 fixes LLM repetition loops

Empero AI has launched the v2 hygiene release of Qwythos-9B, an open-source, 9-billion parameter reasoning model built on an uncensored Qwen3.5 base. This update addresses common local LLM repetition and tool-calling issues by employing Final-Token Preference Optimization to eliminate decoding loops under greedy settings and restoring the native multi-token prediction head.

OPEN SOURCE2h ago

meshoptimizer is an open-source C/C++ library that optimizes 3D triangle meshes to reduce file sizes and accelerate GPU rendering performance.

meshoptimizer is a high-performance C/C++ library designed to optimize 3D meshes for faster rendering and smaller file sizes. Developed by Arseny Kapoulkine, it provides a comprehensive suite of algorithms for vertex cache optimization, vertex fetch optimization, overdraw reduction, mesh simplification (Level of Detail), and data compression. The project includes gltfpack, an opinionated tool for optimizing glTF scenes, along with WebAssembly and JavaScript bindings for web applications, making it a staple in graphics pipelines and game engines.

UPDATE3h ago

Abacus AI integrates Supercomputer with agentic workflows

Abacus AI has integrated its Supercomputer with agentic workflows in Max Mode, giving LLMs like Fable 5 root access to a persistent Linux environment to execute, debug, and host full-stack applications autonomously.