OPEN_SOURCE ↗
REDDIT · REDDIT// 35d agoOPENSOURCE RELEASE
TraceML brings live PyTorch training visibility
TraceML is an open-source PyTorch observability tool from TraceOpt that wraps the training step in a simple context manager and surfaces live timing, memory, dataloader, and DDP skew signals while a run is still in progress. It targets the gap between heavyweight profilers and generic dashboards with support for single-GPU runs, single-node DDP, Hugging Face Trainer, and PyTorch Lightning.
// ANALYSIS
This is the kind of ML infra utility teams often build badly in-house, so a lightweight open-source version has real value. TraceML’s pitch is strong because it focuses on the question practitioners actually ask mid-run: why is training slower or less stable than it should be?
- –The core UX is excellent: `with trace_step(model):` is a much easier sell than asking researchers to stop and open a full profiler
- –Step-level breakdowns for dataloader, forward, backward, optimizer, and memory hit a practical debugging sweet spot for day-to-day training work
- –Median-vs-worst-rank and skew views are especially useful for catching DDP stragglers before they become a bigger cluster efficiency problem
- –The project is still early and explicitly not a replacement for PyTorch Profiler or Nsight, with multi-node and FSDP support still on the roadmap
// TAGS
tracemlopen-sourcemlopsgpudata-tools
DISCOVERED
35d ago
2026-03-07
PUBLISHED
35d ago
2026-03-07
RELEVANCE
8/ 10
AUTHOR
traceml-ai