From 44fd3c0a0e1f3b242e6cdc7cd3151ac2d715a73c Mon Sep 17 00:00:00 2001 From: Maxime Ellerbach Date: Mon, 15 Jun 2026 14:15:09 +0000 Subject: [PATCH] adding docs for FSDP --- docs/source/multi_gpu_training.mdx | 46 ++++++++++++++++++++++++++++++ 1 file changed, 46 insertions(+) diff --git a/docs/source/multi_gpu_training.mdx b/docs/source/multi_gpu_training.mdx index d7369e8f8..181390485 100644 --- a/docs/source/multi_gpu_training.mdx +++ b/docs/source/multi_gpu_training.mdx @@ -113,6 +113,52 @@ accelerate launch --num_processes=2 $(which lerobot-train) \ --policy=act ``` +## Training Large Models with FSDP + +DDP replicates the full model on every GPU, so a model that doesn't fit on one GPU won't fit under +DDP either. For large models, use **FSDP** (Fully Sharded Data Parallel), which shards parameters, +gradients, and optimizer state across GPUs. See the [accelerate FSDP guide](https://huggingface.co/docs/accelerate/usage_guides/fsdp) for background. + +An example on how to launch LeRobot training with FSDP across 4 GPUs (1 machine): + +```bash +accelerate launch --config_file fsdp.yaml --num_processes=4 $(which lerobot-train) \ + --dataset.repo_id=${HF_USER}/my_dataset \ + --policy.type= \ + --output_dir=outputs/train/my_policy_fsdp +``` + +A minimal `fsdp.yaml` (FSDP1; shards params/grads/optimizer — ZeRO-3-equivalent): + +```yaml +compute_environment: LOCAL_MACHINE +distributed_type: FSDP +mixed_precision: bf16 +num_machines: 1 +num_processes: 4 +fsdp_config: + fsdp_version: 1 + fsdp_sharding_strategy: FULL_SHARD # params + grads + optimizer (ZeRO-3) + fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP + fsdp_transformer_layer_cls_to_wrap: # repeated block class to shard + fsdp_use_orig_params: true # required: optimizer is built pre-prepare + fsdp_state_dict_type: FULL_STATE_DICT +``` + +Set `fsdp_transformer_layer_cls_to_wrap` to your model's repeated transformer-block class so each +block is sharded as its own unit. `fsdp_use_orig_params: true` is required because LeRobot builds the +optimizer before `accelerator.prepare()`. + +### FSDP checkpoints + +LeRobot gathers the full state dict across all ranks and the main process writes it as a single +`model.safetensors`, loadable as usual with `Policy.from_pretrained(...)`. Two thigs to look out for: + +- With mixed precision, (`bf16`/`fp16`) FSDP keeps an fp32 master copy, so the checkpoint is fp32 + (~2× the bf16 size on disk) and is cast back to the policy dtype on load. +- **Optimizer state is not saved under FSDP**, so **resume-from-checkpoint is not supported**. + Saved weights are fully usable for evaluation and fine-tuning. + ## Notes - The `--policy.use_amp` flag in `lerobot-train` is only used when **not** running with accelerate. When using accelerate, mixed precision is controlled by accelerate's configuration.