REDDIT · REDDIT// 1d agoOPENSOURCE RELEASE

Usenet Corpus spans 103B tokens, 1980-2013

This release documents a massive cleaned Usenet archive covering 1980 through 2013, with 103.1B tokens, 408.2M records, and 18,347 newsgroups. The dataset card also explains the sanitation pipeline, language detection, and intended uses for pretraining and historical text research.

// ANALYSIS

This is a serious corpus release, not just a novelty archive: the value is the combination of scale, time span, and documented cleaning, which makes it usable for long-context and historical language work.

–Deduplication, PII redaction, binary filtering, and Message-ID hashing make the corpus much more training-friendly than raw Usenet dumps.
–The 33-year arc is the standout feature: it captures pre-web, peak-web, and post-peak online discourse in one dataset.
–96.6% English with 100+ other languages, especially in `soc.*`, gives it enough multilingual texture to matter without turning it into a generic multilingual dump.
–The main caveat is provenance and bias: Usenet overrepresents technical, academic, and heavily quoted discussion, so it is best as a domain-adaptation source rather than a clean proxy for general internet language.

// TAGS

datasettrainingresearchopen-sourcellmusenet-corpus-1980-2013

DISCOVERED

1d ago

2026-05-01

PUBLISHED

1d ago

2026-05-01

RELEVANCE

8/ 10

AUTHOR

OwnerByDane