OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoOPENSOURCE RELEASE
100k CoT Email Dataset Lands on Hugging Face
Kamisori-daijin's email-datasets-v2-100k is a Hugging Face text-generation dataset with about 99.3k English JSON samples for email-style supervised fine-tuning. The dataset uses a prompt format with explicit <think> reasoning traces followed by a <generate> response, and the card says it was created with Gemma 3-4B-it.
// ANALYSIS
The interesting part is not just the size, but the training signal: it gives the model visible reasoning scaffolding instead of answer-only supervision. That can help a small local model learn a more consistent response structure, but the dataset also looks narrowly templated, so it may teach style imitation more than robust reasoning.
- –Strong fit if your goal is controlled SFT on email-like outputs with explicit reasoning traces.
- –The Hugging Face card shows `99.3k` rows, so the “100k” label is approximate rather than exact.
- –The Reddit thread raises a real risk: limited prompt diversity can encourage overfitting to template patterns and plausible-sounding fabrication.
- –The main experimental question is whether full CoT traces improve reasoning consistency or just make the model better at reproducing a reasoning format.
- –Apache-2.0 is straightforward for reuse, but the dataset card also notes Gemma-generated content terms, so downstream use should respect those constraints.
// TAGS
datasetschain-of-thoughtcotfine-tuninglocal-llmreasoninghugging-faceemail
DISCOVERED
3h ago
2026-04-17
PUBLISHED
18h ago
2026-04-16
RELEVANCE
7/ 10
AUTHOR
AdhesivenessSea9511