OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoINFRASTRUCTURE
TurboQuant restores Qwen3.6 speculative decoding
A PR to TheTom's llama-cpp-turboquant fork syncs upstream llama.cpp speculative checkpointing so hybrid MoE+SSM models like Qwen3.6-35B-A3B can actually use speculative decoding. The fix targets a silent fallback where recurrent state rollback failed, disabling the speedup path without an obvious error.
// ANALYSIS
This is niche infrastructure work, but it matters because local inference performance often dies in exactly these edge-case compatibility gaps.
- –Qwen3.6's hybrid recurrent layers need checkpoint-style state restore when draft tokens are rejected
- –The PR pulls in upstream llama.cpp work, including save/restore support for recurrent speculative decoding
- –Benchmarks in the PR show no meaningful regression on M2 Pro, with identical Wikitext-2 perplexity
- –Draft/main tokenizer mismatch remains a practical speed limit, especially for Qwen3.5-to-Qwen3.6 pairings
- –For regular llama.cpp users this is already handled; the value is specific to TurboQuant fork deployments
// TAGS
llama-cpp-turboquantqwen3.6inferenceopen-sourcellmbenchmark
DISCOVERED
5h ago
2026-04-22
PUBLISHED
5h ago
2026-04-21
RELEVANCE
8/ 10
AUTHOR
dangerousdotnet