OPEN_SOURCE ↗
REDDIT · REDDIT// 13d agoRESEARCH PAPER
Llama 1B continued pretraining wipes chat skills
A developer reports that continued pretraining a Llama 1B model on about 9 million characters of raw text turned it into a repetitive output generator instead of a question-answering model. The post is a cautionary report about catastrophic forgetting from pushing raw-text adaptation too far, too fast.
// ANALYSIS
At 2e-4 on a tiny corpus, this looks more like overfitting and distribution drift than a broken model. Continued pretraining can add domain knowledge, but it will not preserve chat behavior unless you explicitly protect that behavior during training.
- –If you started from an instruct-tuned checkpoint, raw-text CPT likely stripped the alignment layer; the safer pattern is base CPT first, then a fresh instruction-tuning pass.
- –Meta’s Llama 3.2 model card says the chat-oriented variants are built with post-training on top of the pretrained base, so raw CPT alone is not supposed to carry assistant behavior.
- –Apple’s 2025 work on forgetting says even a small replay mix of pretraining data can help shield the model, and continual-pretraining papers make that general-vs-domain tradeoff explicit.
- –The simplest guardrail is a mixture: keep some general/instruction data in the run, clean out markup or boilerplate, and stop early before the model locks onto template noise.
// TAGS
llamallmfine-tuningresearchopen-weights
DISCOVERED
13d ago
2026-03-29
PUBLISHED
13d ago
2026-03-29
RELEVANCE
8/ 10
AUTHOR
SUPRA_1934