TurboQuant restores Qwen3.6 speculative decoding
A PR to TheTom's llama-cpp-turboquant fork syncs upstream llama.cpp speculative checkpointing so hybrid MoE+SSM models like Qwen3.6-35B-A3B can actually use speculative decoding. The fix targets a silent fallback where recurrent state rollback failed, disabling the speedup path without an obvious error.
This is niche infrastructure work, but it matters because local inference performance often dies in exactly these edge-case compatibility gaps.
- –Qwen3.6's hybrid recurrent layers need checkpoint-style state restore when draft tokens are rejected
- –The PR pulls in upstream llama.cpp work, including save/restore support for recurrent speculative decoding
- –Benchmarks in the PR show no meaningful regression on M2 Pro, with identical Wikitext-2 perplexity
- –Draft/main tokenizer mismatch remains a practical speed limit, especially for Qwen3.5-to-Qwen3.6 pairings
- –For regular llama.cpp users this is already handled; the value is specific to TurboQuant fork deployments
DISCOVERED
45d ago
2026-04-22
PUBLISHED
45d ago
2026-04-21
RELEVANCE
AUTHOR
dangerousdotnet