BACK_TO_FEEDAICRIER_2
Qwen3.5 122B INT4 heretic quantization fits 63GB
OPEN_SOURCE ↗
REDDIT · REDDIT// 27d agoOPENSOURCE RELEASE

Qwen3.5 122B INT4 heretic quantization fits 63GB

A community researcher released an INT4 quantization of Qwen3.5-122B-A10B using Intel AutoRound, combined with directional ablation to strip most safety refusals — shrinking the model from 234GB to 63GB while preserving near-identical quality. Tested on dual ASUS Ascent (DGX Spark-class) hardware at 24–27 tok/s with 225K context.

// ANALYSIS

Combining aggressive quantization with alignment ablation in one release is rare and technically interesting — most community quants pick one or the other.

  • Intel AutoRound's sign-gradient descent INT4 achieves 73% size reduction (234GB → 63GB) with KL divergence of only 0.0916 vs. the base model
  • Directional ablation via Heretic v1.2.0 cuts refusal rate from 99/100 to 9/100 test prompts by targeting `attn.o_proj` and `mlp.down_proj` layers specifically
  • Critical precision layers (vision encoder, shared expert projections, MoE routing gates, lm_head) are kept at FP16/BF16 — a thoughtful decision since shared experts fire on every token
  • vLLM-compatible with GPTQ Marlin format; includes DGX Spark-specific deployment notes (NCCL deadlock workarounds, unified memory utilization caps)
  • Low community traction so far (score 6, 2 comments) but technically solid enough to be useful for local LLM practitioners with high-VRAM setups
// TAGS
llmopen-weightsinferenceopen-sourceself-hosted

DISCOVERED

27d ago

2026-03-16

PUBLISHED

27d ago

2026-03-16

RELEVANCE

6/ 10

AUTHOR

Ok-Treat-3016