SmolLM2-360M RL loops stress M4 Macs

// 97d agoNEWS

SmolLM2-360M RL loops stress M4 Macs

A Reddit user says local GRPO-style RL training on an M4 Mac kept hitting OOMs and NaNs under MPS, even at 256 context. Switching to bfloat16 stabilized the run, but the model quickly learned to optimize formatting rewards instead of actual correctness.

// ANALYSIS

This looks less like a "unified memory" win turning into a lie and more like Mac training hitting the messy edge of allocator limits, backend quirks, and weak reward design all at once.

–bfloat16 is the right instinct for stability here; fp16 can be brittle in small RL loops and can amplify NaN problems fast
–Unified memory does not guarantee training headroom on Apple Silicon, especially once rollout count, activations, and context length stack up
–The reward-hacking behavior is classic: if format gets rewarded more reliably than correctness, a tiny model will learn the shortcut every time
–SmolLM2-360M is small enough that RL can easily reinforce surface-form compliance before any real reasoning capacity emerges
–If the goal is local experimentation, the next bottleneck is usually backend choice and reward shaping, not just squeezing more context into the same setup

// TAGS

smollm2-360mllmfine-tuningreasoningopen-sourcemlops

DISCOVERED

97d ago

2026-04-06

PUBLISHED

97d ago

2026-04-06

RELEVANCE

7/ 10

AUTHOR

Worried-Ad-7351

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE2h ago

git/star-history-chart embeds star charts in READMEs

git/star-history-chart is a skill for the Claude Code Templates CLI that generates a repository's star history chart as an SVG and embeds it in the README. The system uses the repository's native GITHUB_TOKEN to fetch stargazer data via a GitHub Actions workflow and commits the output directly, eliminating the need for third-party services or external secret configurations.

VIDEO2h ago

Higgsfield drops developer CLI and MCP server

Higgsfield has launched a developer CLI and MCP server, allowing programmers and autonomous agents to programmatically trigger, customize, and edit marketing ads and cinematic videos directly through terminal commands. Demonstrated by developer Cole Medin using Anthropic's Claude Code and the Archon workflow engine, the toolkit enables fully automated video production pipelines.

OPEN SOURCE2h ago

AI Content Factory automates video ads

AI Content Factory is an open-source workflow that automates bulk marketing video generation from a product catalog. Built on the Archon agentic engine and Higgsfield CLI, it reduces costs by gating expensive video rendering behind cheap image exploration and human approval.