Pretraining data curation targets safer alignment
Reddit discussion asks whether harmful behavior can be filtered or rewritten out of pretraining data before a model ever learns it, instead of patching it later with RLHF-style alignment. The poster claims targeted replacement can preserve coherence while suppressing violence and deception, and says a custom wavelet-based prototype already cuts violent generations on WikiText-103.
This is a real research direction, but it looks more like safety shaping than concept deletion. Only very separable concepts are plausibly ablatable; for most semantic targets there is no demonstrated zero floor, just partial reductions and leakage through benign context. RealToxicityPrompts and newer safety-pretraining work show data-centric pretraining can materially reduce toxic outputs, but not eliminate them. A Pretrainer's Guide to Training Data found a real safety-versus-capability tradeoff as toxicity filters get stricter. Personas as a Way to Model Truthfulness in Language Models suggests truthfulness is learned from data structure, so isolated deletions will not fully erase deception. The biggest risk is entanglement with legitimate scientific, code, and historical text, which can quietly dent scientific and algorithmic capability if curation is too blunt.
DISCOVERED
12d ago
2026-03-30
PUBLISHED
13d ago
2026-03-29
RELEVANCE
AUTHOR
Real_Beach6493