From 44fd3c0a0e1f3b242e6cdc7cd3151ac2d715a73c Mon Sep 17 00:00:00 2001
From: Maxime Ellerbach <maxime.ellerbach@huggingface.co>
Date: Mon, 15 Jun 2026 14:15:09 +0000
Subject: [PATCH] adding docs for FSDP

---
 docs/source/multi_gpu_training.mdx | 46 ++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/docs/source/multi_gpu_training.mdx b/docs/source/multi_gpu_training.mdx
index d7369e8f8..181390485 100644
--- a/docs/source/multi_gpu_training.mdx
+++ b/docs/source/multi_gpu_training.mdx
@@ -113,6 +113,52 @@ accelerate launch --num_processes=2 $(which lerobot-train) \
   --policy=act
 ```
 
+## Training Large Models with FSDP
+
+DDP replicates the full model on every GPU, so a model that doesn't fit on one GPU won't fit under
+DDP either. For large models, use **FSDP** (Fully Sharded Data Parallel), which shards parameters,
+gradients, and optimizer state across GPUs. See the [accelerate FSDP guide](https://huggingface.co/docs/accelerate/usage_guides/fsdp) for background.
+
+An example on how to launch LeRobot training with FSDP across 4 GPUs (1 machine):
+
+```bash
+accelerate launch --config_file fsdp.yaml --num_processes=4 $(which lerobot-train) \
+  --dataset.repo_id=${HF_USER}/my_dataset \
+  --policy.type=<your_policy> \
+  --output_dir=outputs/train/my_policy_fsdp
+```
+
+A minimal `fsdp.yaml` (FSDP1; shards params/grads/optimizer — ZeRO-3-equivalent):
+
+```yaml
+compute_environment: LOCAL_MACHINE
+distributed_type: FSDP
+mixed_precision: bf16
+num_machines: 1
+num_processes: 4
+fsdp_config:
+  fsdp_version: 1
+  fsdp_sharding_strategy: FULL_SHARD                          # params + grads + optimizer (ZeRO-3)
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_transformer_layer_cls_to_wrap: <YourTransformerBlock>  # repeated block class to shard
+  fsdp_use_orig_params: true                                  # required: optimizer is built pre-prepare
+  fsdp_state_dict_type: FULL_STATE_DICT
+```
+
+Set `fsdp_transformer_layer_cls_to_wrap` to your model's repeated transformer-block class so each
+block is sharded as its own unit. `fsdp_use_orig_params: true` is required because LeRobot builds the
+optimizer before `accelerator.prepare()`.
+
+### FSDP checkpoints
+
+LeRobot gathers the full state dict across all ranks and the main process writes it as a single
+`model.safetensors`, loadable as usual with `Policy.from_pretrained(...)`. Two thigs to look out for:
+
+- With mixed precision, (`bf16`/`fp16`) FSDP keeps an fp32 master copy, so the checkpoint is fp32
+  (~2× the bf16 size on disk) and is cast back to the policy dtype on load.
+- **Optimizer state is not saved under FSDP**, so **resume-from-checkpoint is not supported**.
+  Saved weights are fully usable for evaluation and fine-tuning.
+
 ## Notes
 
 - The `--policy.use_amp` flag in `lerobot-train` is only used when **not** running with accelerate. When using accelerate, mixed precision is controlled by accelerate's configuration.