OPEN_SOURCE ↗
REDDIT · REDDIT// 27d agoOPENSOURCE RELEASE
Qwen3.5 122B INT4 heretic quantization fits 63GB
A community researcher released an INT4 quantization of Qwen3.5-122B-A10B using Intel AutoRound, combined with directional ablation to strip most safety refusals — shrinking the model from 234GB to 63GB while preserving near-identical quality. Tested on dual ASUS Ascent (DGX Spark-class) hardware at 24–27 tok/s with 225K context.
// ANALYSIS
Combining aggressive quantization with alignment ablation in one release is rare and technically interesting — most community quants pick one or the other.
- –Intel AutoRound's sign-gradient descent INT4 achieves 73% size reduction (234GB → 63GB) with KL divergence of only 0.0916 vs. the base model
- –Directional ablation via Heretic v1.2.0 cuts refusal rate from 99/100 to 9/100 test prompts by targeting `attn.o_proj` and `mlp.down_proj` layers specifically
- –Critical precision layers (vision encoder, shared expert projections, MoE routing gates, lm_head) are kept at FP16/BF16 — a thoughtful decision since shared experts fire on every token
- –vLLM-compatible with GPTQ Marlin format; includes DGX Spark-specific deployment notes (NCCL deadlock workarounds, unified memory utilization caps)
- –Low community traction so far (score 6, 2 comments) but technically solid enough to be useful for local LLM practitioners with high-VRAM setups
// TAGS
llmopen-weightsinferenceopen-sourceself-hosted
DISCOVERED
27d ago
2026-03-16
PUBLISHED
27d ago
2026-03-16
RELEVANCE
6/ 10
AUTHOR
Ok-Treat-3016