Reddit claims low-KL Qwen refusal wipe
A LocalLLaMA Reddit post claims a weekend method can strip refusal behavior from Qwen 3.5 2B to 0/120 refusals in minutes while keeping low 50-token KL divergence. The author shares partial logs, calls results reproducible on consumer and multi-GPU hardware, and says a paper is planned but not yet published.
This is an eye-catching benchmark claim, but it is still unreviewed anecdotal evidence until code, method details, and independent replication are available.
- –The reported tradeoff is unusually strong: near-preserved behavior (KL 0.0141) with complete refusal removal.
- –If validated, the technique could materially lower the barrier for safety stripping on open models.
- –The lack of a paper or reproducible artifact right now makes this more of an early signal than a confirmed breakthrough.
DISCOVERED
74d ago
2026-03-14
PUBLISHED
74d ago
2026-03-14
RELEVANCE
AUTHOR
Sliouges