BACK_TO_FEEDAICRIER_2
Codebook packing cuts LLM RAM 25%, stays lossless
OPEN_SOURCE ↗
REDDIT · REDDIT// 29d agoOPENSOURCE RELEASE

Codebook packing cuts LLM RAM 25%, stays lossless

A solo developer built Adaptive Codebook Compression (ACC), a lossless LLM weight compression scheme that exploits the empirical observation that BF16 model weights use far fewer unique values than the theoretical 65,536 the format allows — typically ~7,000–13,000 per layer. By replacing raw weights with packed codebook indices, the tool achieves 10–25% VRAM savings with exact output fidelity, at the cost of roughly 2–3x slower inference.

// ANALYSIS

This is the rare quantization project with a genuinely novel angle: lossless by default, with benchmarks to prove it — cosine similarity >0.999 and exact greedy token match on tested models.

  • The core trick is that BF16 model weights are surprisingly non-diverse: layers in Qwen3-1.7B use only ~13 bits worth of unique values, so packing indices with no wasted bits via LCM-group bitpacking yields real savings
  • VRAM reduction is modest (~18% lossless on tested models) compared to 4-bit GGUF, but the target audience is different: users who cannot tolerate any quality degradation
  • The CPU-offload path is compelling — models that don't fit in VRAM can run entirely from system RAM via a C/OpenMP kernel, at ~0.5 tok/s
  • Speed penalty (~2.3x on GPU) is steep and limits production viability today; llama.cpp's quantization-aware kernels are far more optimized
  • Still a proof-of-concept with slow offline compression (~60 min for 1.7B on CPU), but the intellectual foundation is sound and the lossless claim is verifiable
// TAGS
adaptive-codebook-compressionllminferenceedge-aiopen-sourcegpu

DISCOVERED

29d ago

2026-03-14

PUBLISHED

29d ago

2026-03-14

RELEVANCE

7/ 10

AUTHOR

bigattichouse