BACK_TO_FEEDAICRIER_2
Abliterlitics benchmarks GLM-4.7-Flash abliteration methods
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT

Abliterlitics benchmarks GLM-4.7-Flash abliteration methods

Abliterlitics runs a forensic benchmark of four abliteration techniques on GLM-4.7-Flash, a 59B MoE reasoning model with 64 routed experts per layer. All four hit 100% HarmBench ASR, but they differ sharply on reasoning efficiency, empty-response rates, and downstream benchmark drift.

// ANALYSIS

The real story is not that safety disappears, but that different abliteration methods trade off where the damage shows up. Heretic looks like the cleanest cut; broader or router-heavy edits preserve ASR but increasingly distort reasoning efficiency and output reliability.

  • Heretic is the best balance here: strongest GSM8K adjusted score, lowest empty rate, and the smallest visible collateral on the rest of the benchmark suite.
  • HauhauCS does not look “lossless” in practice; its raw GSM8K drop is mostly an empty-response problem, but the higher empty rate still means worse usability on a reasoning model.
  • Abliterix is the most extreme case of preserving underlying reasoning while breaking delivery: adjusted GSM8K stays near base, but half of raw runs go empty.
  • The CoT forensics are the most interesting part: safety reasoning still appears in a large share of outputs even when the final refusal layer is gone, which suggests the edits reroute behavior more than erase it.
  • Cross-technique cosine similarity staying low supports the paper’s main conclusion: there is no single universal “abliteration subspace,” even on the same base model.
// TAGS
abliterliticsglm-4.7-flashhereticbenchmarksafetyreasoningllm

DISCOVERED

3h ago

2026-04-28

PUBLISHED

6h ago

2026-04-28

RELEVANCE

9/ 10

AUTHOR

nathandreamfast