- Move cudnn_deterministic to per-spec train_args instead of hardcoding
it for all models. cuBLAS deterministic mode triggers internal errors
on Gemma-based models (pi0, pi05) during backward pass.
- Enable use_amp=true for pi0, pi0_fast, and pi05 to reduce memory
footprint from fp32 (~16GB weights alone) to bf16, fitting within
22GB GPU budget with room for activations and gradients.
- Small models (act, diffusion, multi_task_dit) still use deterministic
mode for reproducible profiling results.
Made-with: Cursor
Remove cProfile wrapping from the training loop and profiling utilities.
The torch profiler already captures fine-grained timing and operator
breakdowns; cProfile added redundant overhead without actionable
insight for GPU-bound models.
- Remove render_cprofile_summary, run_with_cprofile from profiling_utils
- Replace cProfile-wrapped calls in lerobot_train with direct calls
- Remove cprofile_summaries from artifact index in run_model_profiling
- Update tests to match
Made-with: Cursor