Commit Graph

9 Commits

Author SHA1 Message Date
Pepijn dbe01b0444 fix(profiling): fix pi0 cuBLAS error and pi05 OOM on 22GB GPU
- Move cudnn_deterministic to per-spec train_args instead of hardcoding
  it for all models. cuBLAS deterministic mode triggers internal errors
  on Gemma-based models (pi0, pi05) during backward pass.
- Enable use_amp=true for pi0, pi0_fast, and pi05 to reduce memory
  footprint from fp32 (~16GB weights alone) to bf16, fitting within
  22GB GPU budget with room for activations and gradients.
- Small models (act, diffusion, multi_task_dit) still use deterministic
  mode for reproducible profiling results.

Made-with: Cursor
2026-04-16 15:34:17 +02:00
Pepijn e16a95a78e refactor(profiling): remove cProfile, keep torch profiler only
Remove cProfile wrapping from the training loop and profiling utilities.
The torch profiler already captures fine-grained timing and operator
breakdowns; cProfile added redundant overhead without actionable
insight for GPU-bound models.

- Remove render_cprofile_summary, run_with_cprofile from profiling_utils
- Replace cProfile-wrapped calls in lerobot_train with direct calls
- Remove cprofile_summaries from artifact index in run_model_profiling
- Update tests to match

Made-with: Cursor
2026-04-16 15:34:17 +02:00
Pepijn 6d1a5fca02 fix(profiling): keep ci green when hub publish is unauthorized 2026-04-16 13:07:30 +02:00
Pepijn 8d7099cd7d fix(profiling): publish preview runs via hf dataset prs 2026-04-16 12:50:57 +02:00
Pepijn 516f39685a fix(profiling): skip dataset creation on publish 2026-04-16 12:09:03 +02:00
Pepijn b27e838376 fix(profiling): publish preview rows to existing dataset 2026-04-16 11:54:35 +02:00
Pepijn 40470648d1 feat(profiling): publish preview runs for dashboard debugging 2026-04-16 10:54:34 +02:00
Pepijn 28e8483297 fix(ci): disable policy hub push in profiling runs 2026-04-15 23:02:28 +02:00
Pepijn 1a2aec1b04 feat(profiling): add weekly model profiling 2026-04-15 22:31:44 +02:00