diff --git a/docs/source/multitask_dit.mdx b/docs/source/multitask_dit.mdx index e69de29bb..960b1c740 100644 --- a/docs/source/multitask_dit.mdx +++ b/docs/source/multitask_dit.mdx @@ -0,0 +1,295 @@ +# Multi-Task DiT Policy + +Multi-Task Diffusion Transformer (DiT) Policy is an evolution of the original Diffusion Policy architecture, which leverages a large DiT with text and vision conditioning for multi-task robot learning. This implementation supports both diffusion and flow matching objectives for action generation, enabling robots to perform diverse manipulation tasks conditioned on language instructions. + +## Model Overview + +The model uses: + +- **CLIP Vision Encoder**: Processes RGB images from multiple camera views +- **CLIP Text Encoder**: Encodes language task instructions (frozen weights with learnable projection) +- **Diffusion Transformer**: Predicts action sequences conditioned on observations and language +- **Two Objectives**: Supports both diffusion (DDPM/DDIM) and flow matching for action generation + +This model is exciting because you can achieve extremely high dexterity, competitive with multi-billion parameter +VLAs, with only ~450M parameters and significantly less training. + +## Installation Requirements + +Multi-Task DiT Policy has additional dependencies. Install it with: + +```bash +pip install lerobot[multi_task_dit] +``` + +This will install all necessary dependencies including the HuggingFace Transformers library for CLIP models. + +## Usage + +To use Multi-Task DiT in your LeRobot configuration, specify the policy type as: + +```python +policy.type=multi_task_dit +``` + +## Training + +### Basic Training Command + +Here's a complete training command for training Multi-Task DiT on your dataset: + +```bash +lerobot-train \ + --dataset.repo_id=$DATASET_ID \ + --output_dir=$OUTPUT_DIR \ + --job_name=$JOB_NAME \ + --policy.type=multi_task_dit \ + --policy.device=cuda \ + --batch_size=32 \ + --steps=5000 \ + --save_freq=500 \ + --log_freq=100 \ + --wandb.enable=true \ + --policy.repo_id=$REPO_ID +``` + +### Recommended Hyperparameters and Dataset Details (30Hz Control Frequency) + +For reliable performance, start with these suggested default hyperparameters: + +```bash +lerobot-train \ + --dataset.repo_id=$DATASET_ID \ + --output_dir=$OUTPUT_DIR \ + --job_name=$JOB_NAME \ + --policy.type=multi_task_dit \ + --policy.device=cuda \ + --batch_size=320 \ + --steps=30000 \ + --policy.horizon=32 \ + --policy.n_action_steps=24 \ + --policy.objective=diffusion \ + --policy.noise_scheduler_type=DDPM \ + --policy.num_train_timesteps=100 \ + --wandb.enable=true +``` + +**Key Parameters:** + +- **Batch Size**: 192-320 - If you have access to a GPU that can support this, you will get the best training dynamics +- **Horizon**: 32 - number of action steps to predict, ~1.0 sec at 30Hz +- **n_action_steps**: 24 - ~0.8 seconds at 30Hz +- **Objective**: `diffusion` - start with diffusion and experiment with flow matching if generation quality is poor +- **Training Steps**: >30k steps recommended for a single task + +### Training Configuration Parameters + +#### Objective Selection + +Choose between diffusion and flow matching: + +```bash +# Diffusion objective (default) +--policy.objective=diffusion \ +--policy.noise_scheduler_type=DDPM \ # or "DDIM" +--policy.num_train_timesteps=100 \ +--policy.num_inference_steps=10 \ # For faster inference + +# Flow matching objective +--policy.objective=flow_matching \ +--policy.timestep_sampling_strategy=beta \ # or "uniform" | the beta sampling strategy performance appears much better in practice +--policy.num_integration_steps=100 \ +--policy.integration_method=euler \ # or "rk4" +``` + +#### Transformer Architecture + +Adjust model capacity based on dataset size: + +```bash +# Small datasets (< 100 examples) +--policy.num_layers=4 \ +--policy.hidden_dim=512 + +# Medium datasets (100-5k examples) - default +--policy.num_layers=6 \ +--policy.hidden_dim=512 + +# Large datasets (> 5k examples) +--policy.num_layers=8 \ +--policy.hidden_dim=512 +``` + +#### Vision Encoder Configuration + +```bash +# Use different CLIP model for more expressivity at the cost of inference time +--policy.vision_encoder_name=openai/clip-vit-large-patch14 + +# Image preprocessing +--policy.image_resize_shape=[XXX,YYY] \ # you may need to resize your images for inference speed ups +--policy.image_crop_shape=[224,224] \ +--policy.image_crop_is_random=true # Random during training, center at inference +``` + +#### Learning Rate Configuration + +The vision encoder uses a separate learning rate multiplier, where 1/10th is suggested to be the ideal staritng point: + +```bash +--policy.optimizer_lr=2e-5 \ +--policy.vision_encoder_lr_multiplier=0.1 # Vision encoder LR = 0.1 * optimizer_lr +``` + +### Training Tuning Guidelines + +#### 1. Flow Matching with Beta Sampling + +Consider switching to flow matching with beta sampling distribution for potentially improved performance: + +```bash +--policy.objective=flow_matching \ +--policy.timestep_sampling_strategy=beta \ +--policy.timestep_sampling_alpha=1.5 \ +--policy.timestep_sampling_beta=1.0 \ +--policy.timestep_sampling_s=0.999 +``` + +This hasn't been shown to be a silver bullet across every user case, but it occasionally results in smoother and more consistent actions. + +#### 2. Number of Transformer Layers + +Match model capacity to your dataset size: + +- **Small datasets** (< 100 examples): Reduce to 4 layers +- **Large datasets** (> 5k examples): Increase to 8 layers + +#### 3. `horizon` Tuning + +The model can be sensitive to the horizon you choose. Start with around a 1 second horizon based on your control frequency: + +- **30 Hz frequency**: `horizon=30` +- **10 Hz frequency**: `horizon=10` + +Then experiment with increasing from there. The horizon determines how far into the future the model predicts actions. + +#### 4. `n_action_steps` Sensitivity + +The model can also be very sensitive to `n_action_steps`. Start with it being around 0.8 seconds based on your control frequency and tune from there: + +- **Lower values**: More reactive but potentially less stable for long-horizon tasks +- **Higher values**: Better for long-horizon execution but open-loop failures are limited in their recovery + +### Inference Tuning + +For faster inference, use DDIM with fewer sampling steps: + +```bash +--policy.noise_scheduler_type=DDIM \ +--policy.num_inference_steps=10 +``` + +### Resuming Training + +To resume training from a checkpoint: + +```bash +lerobot-train \ + --config_path=$OUTPUT_DIR/checkpoints/00001000/pretrained_model/train_config.json \ + --resume=true \ + --output_dir=$OUTPUT_DIR +``` + +The checkpoint directory should contain `model.safetensors` and `config.json` files (saved automatically during training). + +## Common Failure Modes and Debugging + +Training these models can be finicky. Here are common failure modes and debugging approaches: + +### Idling / No Motion + +The model may "collapse" during inference, resulting in static or no motion. This can occur when: + +1. **Insufficient training data**: If you only have 20-50 examples, try to roughly double your dataset size. Once you have above 300 examples, if you're still seeing this, the task may be too complex. + +2. **Multiple similar tasks**: When your dataset contains multiple similar tasks (e.g., picking up 2 different objects), the model may rely too heavily on language conditioning which might not be rich enough. + +**Debugging tips:** + +- Increase dataset size (double until you get to over 300 examples) +- Train for longer, up to 100k steps, even when the loss flatlines +- Check if the model is receiving proper language instructions or increase diversity of instruction + +### Executing the Wrong Task + +Sometimes the robot will completely ignore your instruction and perform some other task. This generally only happens if you have trained on multiple tasks. + +**Potential causes:** + +- Language instruction ambiguity +- Insufficient task-specific training data +- Model confusion between similar tasks in the multitask dataset + +**Debugging tips:** + +- Verify language instruction specificity, especially if descriptions are similar between multiple tasks +- Check task distribution in your training dataset and add weighting to the failing/ignored task +- Consider task-specific fine-tuning + +### Training Instability + +If training loss is unstable or diverging: + +- Try adjusting learning rate between `1e-5` and `3e-4` +- Increase batch size if possible +- Check that your dataset normalization is correct +- Verify image preprocessing is working correctly + +## Performance Considerations + +### GPU Requirements + +- **Inference**: At least an RTX 5070 Ti (or equivalent GPU) is recommended for reasonable speed performance +- **Training**: A GPU with enough VRAM to load batch sizes of >64 is ideal, which will vary depending on the number of image observations, etc + +### Batch Size Recommendations + +- **Minimum**: 64 (less than this may result in unstable training) +- **Recommended**: 256-320 (best performance, requires larger GPU) + +## Example: Training on Custom Dataset + +Here's a complete example training on a custom dataset: + +```bash +lerobot-train \ + --dataset.repo_id=your_username/your_dataset \ + --output_dir=outputs/multitask_dit_training \ + --policy.type=multi_task_dit \ + --policy.device=cuda \ + --batch_size=320 \ + --steps=30000 \ + --save_freq=1000 \ + --log_freq=100 \ + --eval_freq=1000 \ + --policy.horizon=32 \ + --policy.n_action_steps=24 \ + --policy.objective=diffusion \ + --policy.noise_scheduler_type=DDPM \ + --policy.num_layers=6 \ + --policy.hidden_dim=512 \ + --policy.vision_encoder_name=openai/clip-vit-base-patch16 \ + --policy.image_resize_shape=[320,240] \ + --policy.image_crop_shape=[224,224] \ + --wandb.enable=true \ + --wandb.project=multitask_dit \ + --policy.repo_id=your_username/multitask_dit_policy +``` + +## References + +For more details on the technical implementation and architecture, see: + +- [A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation](https://arxiv.org/abs/2507.05331) +- [Large Behavior Models and Atlas Find New Footing](https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/) +- [Dissecting and Open-Sourcing Multitask Diffusion Transformer Policy](https://brysonkjones.substack.com/p/dissecting-and-open-sourcing-multitask-diffusion-transformer-policy)