REDDIT · REDDIT// 5h agoBENCHMARK RESULT

Smolcluster GRPO tests 64-token summaries

The project is training tiny LFM2.5-350M and Qwen2.5-0.5B-Instruct models on Reddit summarization with GRPO across a 3x Mac mini cluster. The latest update shifts toward comparing length-penalty-only training with quality-aware rewards after earlier evals showed weak BLEU and ROUGE-L under the strict 64-token constraint.

// ANALYSIS

The core issue is reward mismatch: forcing exactly 64 tokens can help the task, but it also fights overlap metrics that already punish brevity, so the baseline can look worse than it is.

–DeepEval plus a GPT-5 judge is the right call here because faithfulness and clarity matter more than n-gram overlap for summary quality.
–The 3x Mac mini plus MLX/vLLM-metal setup is a credible low-cost RL lab for small-model experimentation, not just a hardware stunt.
–If the next SFT/DPO run beats GRPO, that would suggest the optimization problem is simpler than the reward design implies.
–The most interesting result will be whether length-conditioned supervision can hold the 64-token target without the metric collision seen in BLEU and ROUGE-L.

// TAGS

llmsmall-llmtrainingevaluationfine-tuningtraining-infrasmolcluster

DISCOVERED

5h ago

2026-05-05

PUBLISHED

9h ago

2026-05-05

RELEVANCE

8/ 10

AUTHOR

East-Muffin-6472