Smolcluster GRPO favors staged curricula

// 45d agoBENCHMARK RESULT

Smolcluster GRPO favors staged curricula

A side-project blog reports GRPO experiments on sub-500M models for 64-token Reddit summarization, trained on a 3x Mac mini M4 cluster with MLX and distributed vLLM rollouts. The staged curriculum, where length is learned first and quality second, outperformed joint length-plus-quality training across both Qwen2.5-0.5B-Instruct and LFM-2.5-350M.

// ANALYSIS

This reads less like a model release and more like a useful lesson in reward design: for tiny summarizers, the order of objectives matters more than stacking every signal at once.

–Staged training beat joint training across both base models, which suggests the length constraint is doing real optimization work rather than acting as a cosmetic prompt rule
–METEOR plus ROUGE-L emerged as the most reliable reward mix; BLEU alone was not a strong standalone signal for this summarization task
–The failure mode is familiar: unconstrained quality rewards drift into a coverage-versus-conciseness tradeoff, and the 64-token cap acts like a regularizer
–The infra is the other noteworthy part: MLX on Apple Silicon plus asynchronous remote rollouts via vLLM is a practical pattern for small teams without a GPU cluster
–Full bf16 parameters, frozen ref model overhead, and memory-tight training make this a good reference for what is barely feasible on consumer hardware

// TAGS

smolclusterllmsmall-llmfine-tuningtrainingtraining-infraevaluationbenchmark

DISCOVERED

45d ago

2026-05-26

PUBLISHED

45d ago

2026-05-26

RELEVANCE

8/ 10

AUTHOR

East-Muffin-6472

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL6m ago

Brockman showcases Sol Ultra computer use

OpenAI co-founder Greg Brockman demonstrated "computer use" capabilities powered by GPT-5.6 Sol Ultra, the flagship tier of the new GPT-5.6 model family. The model leverages a cooperating subagent architecture to run tasks in parallel, achieving state-of-the-art benchmark results despite high compute demands.

MODEL38m ago

GPT-5.6 excels as agentic orchestrator

AI researcher and educator Elvis Saravia shared his positive experience using OpenAI's newly released GPT-5.6 model, expressing surprise at how effectively it performs in high-level orchestrator roles, specifically noting its strength in verifying and advising within developer workflows.

LAUNCH45m ago

0xDesigner launches Codex Marketplace for coding agents

0xDesigner announced the Codex Marketplace, an ecosystem and hub for discovering, rating, and installing extension plugins, custom skills, and Model Context Protocol (MCP) connectors for the modern agentic coding ecosystem. By providing a unified directory, the marketplace streamlines developer workflows, allowing them to install specialized tools (such as UI design generators or API connectors) directly via the Codex CLI or desktop application, moving agent development away from complex prompt engineering toward modular plugin selection.