BACK_TO_FEEDAICRIER_2
TraceML brings live PyTorch training visibility
OPEN_SOURCE ↗
REDDIT · REDDIT// 35d agoOPENSOURCE RELEASE

TraceML brings live PyTorch training visibility

TraceML is an open-source PyTorch observability tool from TraceOpt that wraps the training step in a simple context manager and surfaces live timing, memory, dataloader, and DDP skew signals while a run is still in progress. It targets the gap between heavyweight profilers and generic dashboards with support for single-GPU runs, single-node DDP, Hugging Face Trainer, and PyTorch Lightning.

// ANALYSIS

This is the kind of ML infra utility teams often build badly in-house, so a lightweight open-source version has real value. TraceML’s pitch is strong because it focuses on the question practitioners actually ask mid-run: why is training slower or less stable than it should be?

  • The core UX is excellent: `with trace_step(model):` is a much easier sell than asking researchers to stop and open a full profiler
  • Step-level breakdowns for dataloader, forward, backward, optimizer, and memory hit a practical debugging sweet spot for day-to-day training work
  • Median-vs-worst-rank and skew views are especially useful for catching DDP stragglers before they become a bigger cluster efficiency problem
  • The project is still early and explicitly not a replacement for PyTorch Profiler or Nsight, with multi-node and FSDP support still on the roadmap
// TAGS
tracemlopen-sourcemlopsgpudata-tools

DISCOVERED

35d ago

2026-03-07

PUBLISHED

35d ago

2026-03-07

RELEVANCE

8/ 10

AUTHOR

traceml-ai