llama.cpp-tq3 shrinks Qwen3.5-27B, fits 16GB GPUs

// 102d agoBENCHMARK RESULT

llama.cpp-tq3 shrinks Qwen3.5-27B, fits 16GB GPUs

TurboQuant-inspired ideas have been pushed into weights via a llama.cpp fork and a new TQ3_1S GGUF quantization for Qwen3.5-27B. On the author’s bench, it lands at 12.9 GB with only a 0.0139 PPL gap to Q4_0, enough to fit the 27B model fully on a 16GB RTX 5060 Ti.

// ANALYSIS

This is a fit-and-efficiency win, not a universal replacement for Q4_0. The meaningful story is that 27B-class local inference just became more practical on consumer GPUs without giving up much quality.

–The key delta is memory, not raw perplexity: about 1.5 GB saved on a 27B model can decide whether it stays entirely on GPU.
–The approach is genuinely algorithmic, combining Walsh-Hadamard rotation, centroid quantization, and dual half-block scales instead of just repackaging existing bits.
–The release depends on a custom llama.cpp fork, so adoption hinges on maintaining that runtime path or upstreaming the support.
–The author’s caveats are important: this is one strong witness on one model and one card, not proof that TQ3_1S generalizes cleanly to every model size.

// TAGS

llama.cpp-tq3open-sourcebenchmarkgpuinferencellm

DISCOVERED

102d ago

2026-04-01

PUBLISHED

102d ago

2026-04-01

RELEVANCE

9/ 10

AUTHOR

pmttyji

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

BENCHMARK12m ago

Gemini 3.5 Pro Tops Rivals in Leak

A leaked benchmark report claims that Google's rumored Gemini 3.5 Pro model achieves superior performance compared to rival models Claude Fable 5 and GPT-5.6 in internal evaluations. The leak suggests significant advancements in Google's next-generation frontier AI model, though official validation is still pending.

NEWS1h ago

Ivan Raskovsky, CTO and Co-founder of GenLayer Foundation, joins RallyOnChain to discuss the protocol's Internet Court initiative and the upcoming Clark Testnet roadmap.

GenLayer Foundation's CTO and Co-founder, Ivan Raskovsky, was featured on the RallyOnChain Community Space (Episode 27) hosted by stargirl_hills and 0X_CUPZ. The discussion centered on GenLayer's vision for an "Internet Court"—a decentralized system enabling AI agents to resolve subjective disputes using natural language processing and consensus. Raskovsky highlighted their progress, including an internal Epoch Zero test run and the roadmap for the upcoming Clark Testnet, which is targeted at autonomous network operations following their initial Asimov and Bradbury testnets.

UPDATE2h ago

Native SDK v0.5 compiles TypeScript to native

Vercel Labs has released Native SDK v0.5, introducing TypeScript support to compile applications directly to native machine code without a JavaScript engine or garbage collector. Designed with AI agents in mind, the update features 83ns update dispatch latency, supports robust TypeScript features, and allows developers to eject to Zig at any point.