Smolcluster GRPO favors staged curricula
A side-project blog reports GRPO experiments on sub-500M models for 64-token Reddit summarization, trained on a 3x Mac mini M4 cluster with MLX and distributed vLLM rollouts. The staged curriculum, where length is learned first and quality second, outperformed joint length-plus-quality training across both Qwen2.5-0.5B-Instruct and LFM-2.5-350M.
This reads less like a model release and more like a useful lesson in reward design: for tiny summarizers, the order of objectives matters more than stacking every signal at once.
- –Staged training beat joint training across both base models, which suggests the length constraint is doing real optimization work rather than acting as a cosmetic prompt rule
- –METEOR plus ROUGE-L emerged as the most reliable reward mix; BLEU alone was not a strong standalone signal for this summarization task
- –The failure mode is familiar: unconstrained quality rewards drift into a coverage-versus-conciseness tradeoff, and the 64-token cap acts like a regularizer
- –The infra is the other noteworthy part: MLX on Apple Silicon plus asynchronous remote rollouts via vLLM is a practical pattern for small teams without a GPU cluster
- –Full bf16 parameters, frozen ref model overhead, and memory-tight training make this a good reference for what is barely feasible on consumer hardware
DISCOVERED
2h ago
2026-05-26
PUBLISHED
3h ago
2026-05-26
RELEVANCE
AUTHOR
East-Muffin-6472