mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-18 18:20:08 +00:00
da7da741f1
Adam optimizer states (exp_avg + exp_avg_sq) require ~16GB extra on top of model params and gradients for 4B parameter models, exceeding the 22GB GPU. SGD has zero optimizer state overhead and profiling only measures forward/backward timing anyway. Also adds torch.cuda.empty_cache() after deterministic forward to release transient memory before the training loop starts. Made-with: Cursor