YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Heretic automates LLM safety-alignment removal

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Heretic automates LLM safety-alignment removal
OPEN LINK ↗
// 86d agoOPENSOURCE RELEASE

Heretic automates LLM safety-alignment removal

Heretic is a Python tool that removes safety alignment ("censorship") from transformer-based LLMs in a single command, no ML expertise required. It implements a parameterized variant of directional ablation with automatic hyperparameter optimization, producing uncensored models with significantly less capability degradation than prior art.

// ANALYSIS

Heretic industrialized what was previously a niche ML research operation — the barrier to abliteration just dropped to `pip install heretic-llm && heretic <model>`, and 1,000+ community models on Hugging Face prove people ran with it.

  • Based on Arditi et al. 2024 ("Refusal in Language Models Is Mediated by a Single Direction"), Heretic orthogonalizes weight matrices against computed "refusal directions" per layer — no retraining, no fine-tuning, models stay on HuggingFace
  • Its key technical edge: co-minimizes refusal rate AND KL divergence from the original, so ablated models lose less general capability — benchmarks show KL of 0.16 vs. 0.45–1.04 for competitors on gemma-3-12b-it
  • v1.2 added 4-bit quantization via LoRA engine, cutting VRAM requirements by 70% and bringing abliteration to consumer hardware
  • The framing matters: calling safety alignment "censorship" and the tool "Heretic" is a deliberate culture-war aesthetic that clearly drove viral spread on r/LocalLLaMA and HN (745 points, 380 comments)
  • Raises real questions about the long-term durability of RLHF-based safety: if a single-command tool with 14k stars can strip it out, alignment baked purely into weights is not a meaningful safety boundary
// TAGS
hereticllmopen-sourcesafetyself-hostedresearch

DISCOVERED

86d ago

2026-03-15

PUBLISHED

86d ago

2026-03-15

RELEVANCE

8/ 10