feat(train): add accelerate for multi gpu training (#2154)

mirror of https://github.com/huggingface/lerobot.git synced 2026-05-21 19:49:49 +00:00

* Enhance training and logging functionality with accelerator support

- Added support for multi-GPU training by introducing an `accelerator` parameter in training functions.
- Updated `update_policy` to handle gradient updates based on the presence of an accelerator.
- Modified logging to prevent duplicate messages in non-main processes.
- Enhanced `set_seed` and `get_safe_torch_device` functions to accommodate accelerator usage.
- Updated `MetricsTracker` to account for the number of processes when calculating metrics.
- Introduced a new feature in `pyproject.toml` for the `accelerate` library dependency.

* Initialize logging in training script for both main and non-main processes

- Added `init_logging` calls to ensure proper logging setup when using the accelerator and in standard training mode.
- This change enhances the clarity and consistency of logging during training sessions.

* add docs and only push model once

* Place  logging under accelerate and update docs

* fix pre commit

* only log in main process

* main logging

* try with local rank

* add tests

* change runner

* fix test

* dont push to hub in multi gpu tests

* pre download dataset in tests

* small fixes

* fix path optimizer state

* update docs, and small improvements in train

* simplify accelerate main process detection

* small improvements in train

* fix OOM bug

* change accelerate detection

* add some debugging

* always use accelerate

* cleanup update method

* cleanup

* fix bug

* scale lr decay if we reduce steps

* cleanup logging

* fix formatting

* encorperate feedback pr

* add min memory to cpu tests

* use accelerate to determin logging

* fix precommit and fix tests

* chore: minor details

---------

Co-authored-by: AdilZouitine <adilzouitinegm@gmail.com>
Co-authored-by: Steven Palma <steven.palma@huggingface.co>

This commit is contained in:

Pepijn

2025-10-16 17:41:55 +02:00

committed by

GitHub

parent 845b359d39

commit e82e7a02e9

13 changed files with 625 additions and 134 deletions

									
										src/lerobot/rl/wandb_utils.py
									
		+1
		-1
	
												View File
												
				@@ -99,7 +99,7 @@ class WandBLogger:

				        cfg.wandb.run_id = run_id

				        # Handle custom step key for rl asynchronous training.

				        self._wandb_custom_step_key: set[str] | None = None

				        print(colored("Logs will be synced with wandb.", "blue", attrs=["bold"]))

				        logging.info(colored("Logs will be synced with wandb.", "blue", attrs=["bold"]))

				        logging.info(f"Track this run --> {colored(wandb.run.get_url(), 'yellow', attrs=['bold'])}")

				        self._wandb = wandb