lerobot

mirror of https://github.com/huggingface/lerobot.git synced 2026-05-16 00:59:46 +00:00

Author	SHA1	Message	Date
Pepijn	1842100402	feat(profiling): record forward/backward/optimizer timings The dashboard expects per-phase timings (forward_s, backward_s, optimizer_s) in step_timing_summary.json, but only total_update_s and dataloading_s were collected — leaving every chart except dataloading empty. Add a lightweight TrainingProfiler.section(name) context manager that times a region with torch.cuda.synchronize before and after (so GPU work is captured, not just the kernel-launch latency) and accumulates per-section samples into step_timing_summary.json. Wrap forward, backward (incl. grad clip), and optimizer (incl. zero_grad and scheduler.step) in update_policy with these sections. When profiling is off (profiler=None) the wrappers become no-ops, so training performance is unchanged outside CI. Made-with: Cursor	2026-04-16 20:26:27 +02:00
Pepijn	b1e16783de	refactor: extract profiling into self-contained TrainingProfiler class Move all profiling orchestration out of lerobot_train.py and TrainPipelineConfig into a TrainingProfiler class in profiling_utils.py. - lerobot_train.py: ~74 lines of profiling code reduced to ~7 call sites - TrainPipelineConfig: 10 profile_* fields reduced to 2 (mode + output_dir) - update_policy: reverted to clean main-branch signature (no timing_collector) - TrainingProfiler encapsulates torch profiler, timing collection, deterministic forward artifacts, and all output writing - CI script (run_model_profiling.py) unchanged—it only passes the 2 kept fields Made-with: Cursor	2026-04-16 16:00:49 +02:00
Pepijn	dbe01b0444	fix(profiling): fix pi0 cuBLAS error and pi05 OOM on 22GB GPU - Move cudnn_deterministic to per-spec train_args instead of hardcoding it for all models. cuBLAS deterministic mode triggers internal errors on Gemma-based models (pi0, pi05) during backward pass. - Enable use_amp=true for pi0, pi0_fast, and pi05 to reduce memory footprint from fp32 (~16GB weights alone) to bf16, fitting within 22GB GPU budget with room for activations and gradients. - Small models (act, diffusion, multi_task_dit) still use deterministic mode for reproducible profiling results. Made-with: Cursor	2026-04-16 15:34:17 +02:00
Pepijn	e16a95a78e	refactor(profiling): remove cProfile, keep torch profiler only Remove cProfile wrapping from the training loop and profiling utilities. The torch profiler already captures fine-grained timing and operator breakdowns; cProfile added redundant overhead without actionable insight for GPU-bound models. - Remove render_cprofile_summary, run_with_cprofile from profiling_utils - Replace cProfile-wrapped calls in lerobot_train with direct calls - Remove cprofile_summaries from artifact index in run_model_profiling - Update tests to match Made-with: Cursor	2026-04-16 15:34:17 +02:00
Pepijn	4137b5785d	fix(profiling): align libero smoke specs with pretrained policies	2026-04-16 15:11:54 +02:00
Pepijn	6d1a5fca02	fix(profiling): keep ci green when hub publish is unauthorized	2026-04-16 13:07:30 +02:00
Pepijn	8d7099cd7d	fix(profiling): publish preview runs via hf dataset prs	2026-04-16 12:50:57 +02:00
Pepijn	40470648d1	feat(profiling): publish preview runs for dashboard debugging	2026-04-16 10:54:34 +02:00
Pepijn	25e5062b2c	fix(profiling): read generic device timings from profiler	2026-04-16 10:29:01 +02:00
Pepijn	35e3b28da1	fix(profiling): normalize timing metrics before export	2026-04-16 10:11:14 +02:00
Pepijn	ed8a98dda6	fix(profiling): preserve policy mode for deterministic forward	2026-04-16 09:50:29 +02:00
Pepijn	1a2aec1b04	feat(profiling): add weekly model profiling	2026-04-15 22:31:44 +02:00

12 Commits