BACK_TO_FEEDAICRIER_2
Reddit claims low-KL Qwen refusal wipe
OPEN_SOURCE ↗
REDDIT · REDDIT// 29d agoBENCHMARK RESULT

Reddit claims low-KL Qwen refusal wipe

A LocalLLaMA Reddit post claims a weekend method can strip refusal behavior from Qwen 3.5 2B to 0/120 refusals in minutes while keeping low 50-token KL divergence. The author shares partial logs, calls results reproducible on consumer and multi-GPU hardware, and says a paper is planned but not yet published.

// ANALYSIS

This is an eye-catching benchmark claim, but it is still unreviewed anecdotal evidence until code, method details, and independent replication are available.

  • The reported tradeoff is unusually strong: near-preserved behavior (KL 0.0141) with complete refusal removal.
  • If validated, the technique could materially lower the barrier for safety stripping on open models.
  • The lack of a paper or reproducible artifact right now makes this more of an early signal than a confirmed breakthrough.
// TAGS
qwen3-5-2bllmsafetybenchmarkopen-weights

DISCOVERED

29d ago

2026-03-14

PUBLISHED

29d ago

2026-03-14

RELEVANCE

7/ 10

AUTHOR

Sliouges