OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
Abliterlitics benchmarks GLM-4.7-Flash abliteration methods
Abliterlitics runs a forensic benchmark of four abliteration techniques on GLM-4.7-Flash, a 59B MoE reasoning model with 64 routed experts per layer. All four hit 100% HarmBench ASR, but they differ sharply on reasoning efficiency, empty-response rates, and downstream benchmark drift.
// ANALYSIS
The real story is not that safety disappears, but that different abliteration methods trade off where the damage shows up. Heretic looks like the cleanest cut; broader or router-heavy edits preserve ASR but increasingly distort reasoning efficiency and output reliability.
- –Heretic is the best balance here: strongest GSM8K adjusted score, lowest empty rate, and the smallest visible collateral on the rest of the benchmark suite.
- –HauhauCS does not look “lossless” in practice; its raw GSM8K drop is mostly an empty-response problem, but the higher empty rate still means worse usability on a reasoning model.
- –Abliterix is the most extreme case of preserving underlying reasoning while breaking delivery: adjusted GSM8K stays near base, but half of raw runs go empty.
- –The CoT forensics are the most interesting part: safety reasoning still appears in a large share of outputs even when the final refusal layer is gone, which suggests the edits reroute behavior more than erase it.
- –Cross-technique cosine similarity staying low supports the paper’s main conclusion: there is no single universal “abliteration subspace,” even on the same base model.
// TAGS
abliterliticsglm-4.7-flashhereticbenchmarksafetyreasoningllm
DISCOVERED
3h ago
2026-04-28
PUBLISHED
6h ago
2026-04-28
RELEVANCE
9/ 10
AUTHOR
nathandreamfast