BACK_TO_FEEDAICRIER_2
OBLITERATUS sharpens MoE-aware refusal removal
OPEN_SOURCE ↗
REDDIT · REDDIT// 1d agoOPENSOURCE RELEASE

OBLITERATUS sharpens MoE-aware refusal removal

OBLITERATUS is a Python and Gradio-based toolkit for analysis-informed “abliteration” of LLM refusal behavior. The project presents itself as a one-click workflow for probing hidden states, extracting refusal directions, applying projection or steering interventions, benchmarking the modified model, and then saving or chatting with the result. It also adds research-oriented features like telemetry-backed aggregation, architecture detection, and explicit MoE support, which makes it feel more like a lab instrument than a simple utility script.

// ANALYSIS

Hot take: this looks promising if you want interpretable refusal removal on tricky architectures, but Heretic still reads as the safer default for fully automated runs.

  • OBLITERATUS’ main edge is its MoE-aware framing: it advertises `surgical` mode, active-parameter GPU sizing, and per-expert analysis, so it is better tailored to MoE targets on paper.
  • Heretic looks more mature as a push-button optimizer, with automatic parameter search and explicit support for dense models plus several MoE architectures, so it is probably the lower-risk choice if you want something proven.
  • The Reddit thread itself is thin on real-world validation, so the “used on a real model” question is still mostly anecdotal rather than settled.
  • If you care about analysis tools, visualization, and community dataset building, OBLITERATUS is the more ambitious project; if you care about getting a decensored model quickly, Heretic is likely the simpler bet.
// TAGS
llmabliterationmechanistic-interpretabilitymoeopen-sourcehuggingfacegradio

DISCOVERED

1d ago

2026-05-02

PUBLISHED

1d ago

2026-05-02

RELEVANCE

8/ 10

AUTHOR

AutomaticDriver5882