OPEN_SOURCE ↗
REDDIT · REDDIT// 10d agoBENCHMARK RESULT
llama.cpp-tq3 shrinks Qwen3.5-27B, fits 16GB GPUs
TurboQuant-inspired ideas have been pushed into weights via a llama.cpp fork and a new TQ3_1S GGUF quantization for Qwen3.5-27B. On the author’s bench, it lands at 12.9 GB with only a 0.0139 PPL gap to Q4_0, enough to fit the 27B model fully on a 16GB RTX 5060 Ti.
// ANALYSIS
This is a fit-and-efficiency win, not a universal replacement for Q4_0. The meaningful story is that 27B-class local inference just became more practical on consumer GPUs without giving up much quality.
- –The key delta is memory, not raw perplexity: about 1.5 GB saved on a 27B model can decide whether it stays entirely on GPU.
- –The approach is genuinely algorithmic, combining Walsh-Hadamard rotation, centroid quantization, and dual half-block scales instead of just repackaging existing bits.
- –The release depends on a custom llama.cpp fork, so adoption hinges on maintaining that runtime path or upstreaming the support.
- –The author’s caveats are important: this is one strong witness on one model and one card, not proof that TQ3_1S generalizes cleanly to every model size.
// TAGS
llama.cpp-tq3open-sourcebenchmarkgpuinferencellm
DISCOVERED
10d ago
2026-04-01
PUBLISHED
10d ago
2026-04-01
RELEVANCE
9/ 10
AUTHOR
pmttyji