YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

llama.cpp-tq3 shrinks Qwen3.5-27B, fits 16GB GPUs

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

llama.cpp-tq3 shrinks Qwen3.5-27B, fits 16GB GPUs
OPEN LINK ↗
// 56d agoBENCHMARK RESULT

llama.cpp-tq3 shrinks Qwen3.5-27B, fits 16GB GPUs

TurboQuant-inspired ideas have been pushed into weights via a llama.cpp fork and a new TQ3_1S GGUF quantization for Qwen3.5-27B. On the author’s bench, it lands at 12.9 GB with only a 0.0139 PPL gap to Q4_0, enough to fit the 27B model fully on a 16GB RTX 5060 Ti.

// ANALYSIS

This is a fit-and-efficiency win, not a universal replacement for Q4_0. The meaningful story is that 27B-class local inference just became more practical on consumer GPUs without giving up much quality.

  • The key delta is memory, not raw perplexity: about 1.5 GB saved on a 27B model can decide whether it stays entirely on GPU.
  • The approach is genuinely algorithmic, combining Walsh-Hadamard rotation, centroid quantization, and dual half-block scales instead of just repackaging existing bits.
  • The release depends on a custom llama.cpp fork, so adoption hinges on maintaining that runtime path or upstreaming the support.
  • The author’s caveats are important: this is one strong witness on one model and one card, not proof that TQ3_1S generalizes cleanly to every model size.
// TAGS
llama.cpp-tq3open-sourcebenchmarkgpuinferencellm

DISCOVERED

56d ago

2026-04-01

PUBLISHED

56d ago

2026-04-01

RELEVANCE

9/ 10

AUTHOR

pmttyji