Qwen3.5 122B INT4 heretic quantization fits 63GB

// 118d agoOPENSOURCE RELEASE

Qwen3.5 122B INT4 heretic quantization fits 63GB

A community researcher released an INT4 quantization of Qwen3.5-122B-A10B using Intel AutoRound, combined with directional ablation to strip most safety refusals — shrinking the model from 234GB to 63GB while preserving near-identical quality. Tested on dual ASUS Ascent (DGX Spark-class) hardware at 24–27 tok/s with 225K context.

// ANALYSIS

Combining aggressive quantization with alignment ablation in one release is rare and technically interesting — most community quants pick one or the other.

–Intel AutoRound's sign-gradient descent INT4 achieves 73% size reduction (234GB → 63GB) with KL divergence of only 0.0916 vs. the base model
–Directional ablation via Heretic v1.2.0 cuts refusal rate from 99/100 to 9/100 test prompts by targeting `attn.o_proj` and `mlp.down_proj` layers specifically
–Critical precision layers (vision encoder, shared expert projections, MoE routing gates, lm_head) are kept at FP16/BF16 — a thoughtful decision since shared experts fire on every token
–vLLM-compatible with GPTQ Marlin format; includes DGX Spark-specific deployment notes (NCCL deadlock workarounds, unified memory utilization caps)
–Low community traction so far (score 6, 2 comments) but technically solid enough to be useful for local LLM practitioners with high-VRAM setups

// TAGS

llmopen-weightsinferenceopen-sourceself-hosted

DISCOVERED

118d ago

2026-03-16

PUBLISHED

118d ago

2026-03-16

RELEVANCE

6/ 10

AUTHOR

Ok-Treat-3016

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE2h ago

Lightpanda merges IndexedDB support for automation

Lightpanda, the open-source headless browser engine written in Zig for web automation and AI agents, has added base implementation support for IndexedDB to its main branch. This update allows scripts that depend on IndexedDB for client-side storage to execute successfully, removing a significant barrier for automation and scraping workflows on modern web applications.

OPEN SOURCE2h ago

LangChain-Chatchat builds local private RAG pipelines

LangChain-Chatchat is an open-source, local knowledge-based QA application and RAG framework built on LangChain, FastAPI, and Streamlit. It provides a private, offline pipeline that integrates with Ollama and Xinference to support open-source models like Llama3 and Qwen2.

OPEN SOURCE3h ago

prose stylesheet forces clean AI writing

prose is a lightweight, single-file Markdown prompt configuration that guides AI coding agents to communicate like a direct, confident senior engineer. Appended directly to local agent instruction files, it establishes clear rules to eliminate common AI patterns like cheesy setups, over-bulleted reasoning, and theatrical language.