BACK_TO_FEEDAICRIER_2
llama.cpp-tq3 shrinks Qwen3.5-27B, fits 16GB GPUs
OPEN_SOURCE ↗
REDDIT · REDDIT// 10d agoBENCHMARK RESULT

llama.cpp-tq3 shrinks Qwen3.5-27B, fits 16GB GPUs

TurboQuant-inspired ideas have been pushed into weights via a llama.cpp fork and a new TQ3_1S GGUF quantization for Qwen3.5-27B. On the author’s bench, it lands at 12.9 GB with only a 0.0139 PPL gap to Q4_0, enough to fit the 27B model fully on a 16GB RTX 5060 Ti.

// ANALYSIS

This is a fit-and-efficiency win, not a universal replacement for Q4_0. The meaningful story is that 27B-class local inference just became more practical on consumer GPUs without giving up much quality.

  • The key delta is memory, not raw perplexity: about 1.5 GB saved on a 27B model can decide whether it stays entirely on GPU.
  • The approach is genuinely algorithmic, combining Walsh-Hadamard rotation, centroid quantization, and dual half-block scales instead of just repackaging existing bits.
  • The release depends on a custom llama.cpp fork, so adoption hinges on maintaining that runtime path or upstreaming the support.
  • The author’s caveats are important: this is one strong witness on one model and one card, not proof that TQ3_1S generalizes cleanly to every model size.
// TAGS
llama.cpp-tq3open-sourcebenchmarkgpuinferencellm

DISCOVERED

10d ago

2026-04-01

PUBLISHED

10d ago

2026-04-01

RELEVANCE

9/ 10

AUTHOR

pmttyji