BACK_TO_FEEDAICRIER_2
Llama 1B continued pretraining wipes chat skills
OPEN_SOURCE ↗
REDDIT · REDDIT// 13d agoRESEARCH PAPER

Llama 1B continued pretraining wipes chat skills

A developer reports that continued pretraining a Llama 1B model on about 9 million characters of raw text turned it into a repetitive output generator instead of a question-answering model. The post is a cautionary report about catastrophic forgetting from pushing raw-text adaptation too far, too fast.

// ANALYSIS

At 2e-4 on a tiny corpus, this looks more like overfitting and distribution drift than a broken model. Continued pretraining can add domain knowledge, but it will not preserve chat behavior unless you explicitly protect that behavior during training.

  • If you started from an instruct-tuned checkpoint, raw-text CPT likely stripped the alignment layer; the safer pattern is base CPT first, then a fresh instruction-tuning pass.
  • Meta’s Llama 3.2 model card says the chat-oriented variants are built with post-training on top of the pretrained base, so raw CPT alone is not supposed to carry assistant behavior.
  • Apple’s 2025 work on forgetting says even a small replay mix of pretraining data can help shield the model, and continual-pretraining papers make that general-vs-domain tradeoff explicit.
  • The simplest guardrail is a mixture: keep some general/instruction data in the run, clean out markup or boilerplate, and stop early before the model locks onto template noise.
// TAGS
llamallmfine-tuningresearchopen-weights

DISCOVERED

13d ago

2026-03-29

PUBLISHED

13d ago

2026-03-29

RELEVANCE

8/ 10

AUTHOR

SUPRA_1934