GH · GITHUB// 28d agoOPENSOURCE RELEASE

Heretic automates LLM safety-alignment removal

Heretic is a Python tool that removes safety alignment ("censorship") from transformer-based LLMs in a single command, no ML expertise required. It implements a parameterized variant of directional ablation with automatic hyperparameter optimization, producing uncensored models with significantly less capability degradation than prior art.

// ANALYSIS

Heretic industrialized what was previously a niche ML research operation — the barrier to abliteration just dropped to `pip install heretic-llm && heretic <model>`, and 1,000+ community models on Hugging Face prove people ran with it.

–Based on Arditi et al. 2024 ("Refusal in Language Models Is Mediated by a Single Direction"), Heretic orthogonalizes weight matrices against computed "refusal directions" per layer — no retraining, no fine-tuning, models stay on HuggingFace
–Its key technical edge: co-minimizes refusal rate AND KL divergence from the original, so ablated models lose less general capability — benchmarks show KL of 0.16 vs. 0.45–1.04 for competitors on gemma-3-12b-it
–v1.2 added 4-bit quantization via LoRA engine, cutting VRAM requirements by 70% and bringing abliteration to consumer hardware
–The framing matters: calling safety alignment "censorship" and the tool "Heretic" is a deliberate culture-war aesthetic that clearly drove viral spread on r/LocalLLaMA and HN (745 points, 380 comments)
–Raises real questions about the long-term durability of RLHF-based safety: if a single-command tool with 14k stars can strip it out, alignment baked purely into weights is not a meaningful safety boundary

// TAGS

hereticllmopen-sourcesafetyself-hostedresearch

DISCOVERED

28d ago

2026-03-15

PUBLISHED

28d ago

2026-03-15

RELEVANCE

8/ 10