TraceML brings live PyTorch training visibility

// 127d agoOPENSOURCE RELEASE

TraceML brings live PyTorch training visibility

TraceML is an open-source PyTorch observability tool from TraceOpt that wraps the training step in a simple context manager and surfaces live timing, memory, dataloader, and DDP skew signals while a run is still in progress. It targets the gap between heavyweight profilers and generic dashboards with support for single-GPU runs, single-node DDP, Hugging Face Trainer, and PyTorch Lightning.

// ANALYSIS

This is the kind of ML infra utility teams often build badly in-house, so a lightweight open-source version has real value. TraceML’s pitch is strong because it focuses on the question practitioners actually ask mid-run: why is training slower or less stable than it should be?

–The core UX is excellent: `with trace_step(model):` is a much easier sell than asking researchers to stop and open a full profiler
–Step-level breakdowns for dataloader, forward, backward, optimizer, and memory hit a practical debugging sweet spot for day-to-day training work
–Median-vs-worst-rank and skew views are especially useful for catching DDP stragglers before they become a bigger cluster efficiency problem
–The project is still early and explicitly not a replacement for PyTorch Profiler or Nsight, with multi-node and FSDP support still on the roadmap

// TAGS

tracemlopen-sourcemlopsgpudata-tools

DISCOVERED

127d ago

2026-03-07

PUBLISHED

127d ago

2026-03-07

RELEVANCE

8/ 10

AUTHOR

traceml-ai

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE4m ago

Win11Debloat declutters Windows 10 and 11

Win11Debloat is a lightweight, customizable PowerShell script to declutter, optimize, and customize Windows 10 and 11. It allows users to remove pre-installed bloatware apps, disable telemetry, adjust privacy settings, and tweak user interface elements through an interactive menu or command-line arguments.

RESEARCH30m ago

Smart Cellular Bricks achieve decentralized self-repair

A new Nature Communications paper by researchers from the IT University of Copenhagen, Sakana AI, and Autodesk introduces Smart Cellular Bricks, a modular 3D system capable of shape classification and self-repair. Running a decentralized Neural Cellular Automata model, the individual bricks communicate only with immediate neighbors to collectively coordinate recovery without a central controller.

UPDATE1h ago

OpenDesign integrates Meta Muse Spark API

OpenDesign is an open-source, local-first design workspace that can be paired with Meta's Muse Spark to generate code-ready prototypes and UI screens directly from screenshots and prompts. This integration bridges the gap between visual design and software development, providing developers with an interactive workspace to rapidly iterate on AI-generated user interfaces.