OPEN_SOURCE ↗
GH · GITHUB// 28d agoOPENSOURCE RELEASE
Heretic automates LLM safety-alignment removal
Heretic is a Python tool that removes safety alignment ("censorship") from transformer-based LLMs in a single command, no ML expertise required. It implements a parameterized variant of directional ablation with automatic hyperparameter optimization, producing uncensored models with significantly less capability degradation than prior art.
// ANALYSIS
Heretic industrialized what was previously a niche ML research operation — the barrier to abliteration just dropped to `pip install heretic-llm && heretic <model>`, and 1,000+ community models on Hugging Face prove people ran with it.
- –Based on Arditi et al. 2024 ("Refusal in Language Models Is Mediated by a Single Direction"), Heretic orthogonalizes weight matrices against computed "refusal directions" per layer — no retraining, no fine-tuning, models stay on HuggingFace
- –Its key technical edge: co-minimizes refusal rate AND KL divergence from the original, so ablated models lose less general capability — benchmarks show KL of 0.16 vs. 0.45–1.04 for competitors on gemma-3-12b-it
- –v1.2 added 4-bit quantization via LoRA engine, cutting VRAM requirements by 70% and bringing abliteration to consumer hardware
- –The framing matters: calling safety alignment "censorship" and the tool "Heretic" is a deliberate culture-war aesthetic that clearly drove viral spread on r/LocalLLaMA and HN (745 points, 380 comments)
- –Raises real questions about the long-term durability of RLHF-based safety: if a single-command tool with 14k stars can strip it out, alignment baked purely into weights is not a meaningful safety boundary
// TAGS
hereticllmopen-sourcesafetyself-hostedresearch
DISCOVERED
28d ago
2026-03-15
PUBLISHED
28d ago
2026-03-15
RELEVANCE
8/ 10