# Multi-Task DiT Policy Multi-Task Diffusion Transformer (DiT) Policy is an evolution of the original Diffusion Policy architecture, which leverages a large DiT with text and vision conditioning for multi-task robot learning. This implementation supports both diffusion and flow matching objectives for action generation, enabling robots to perform diverse manipulation tasks conditioned on language instructions. ## Model Overview The model uses: - **CLIP Vision Encoder**: Processes RGB images from multiple camera views - **CLIP Text Encoder**: Encodes language task instructions (frozen weights with learnable projection) - **Diffusion Transformer**: Predicts action sequences conditioned on observations and language - **Two Objectives**: Supports both diffusion (DDPM/DDIM) and flow matching for action generation This model is exciting because you can achieve extremely high dexterity, competitive with multi-billion parameter VLAs, with only ~450M parameters and significantly less training. ## Installation Requirements Multi-Task DiT Policy has additional dependencies. Install it with: ```bash pip install lerobot[multi_task_dit] ``` This will install all necessary dependencies including the HuggingFace Transformers library for CLIP models. ## Usage To use Multi-Task DiT in your LeRobot configuration, specify the policy type as: ```python policy.type=multi_task_dit ``` ## Training ### Basic Training Command Here's a complete training command for training Multi-Task DiT on your dataset: ```bash lerobot-train \ --dataset.repo_id={{MY_DATASET_ID}} \ --output_dir={{MY_OUTPUT_DIR}} \ --policy.type=multi_task_dit \ --policy.device=cuda \ --policy.repo_id={{MY_REPO_ID}} --batch_size=32 \ --steps=5000 \ --save_freq=500 \ --log_freq=100 \ --wandb.enable=true ``` ### Recommended Hyperparameters and Dataset Details (30Hz Control Frequency) For reliable performance, start with these suggested default hyperparameters: ```bash lerobot-train \ --dataset.repo_id={{MY_DATASET_ID}} \ --output_dir={{MY_OUTPUT_DIR}} \ --policy.type=multi_task_dit \ --policy.device=cuda \ --batch_size=320 \ --steps=30000 \ --policy.horizon=32 \ --policy.n_action_steps=24 \ --policy.repo_id={{MY_REPO_ID}} \ --policy.objective=diffusion \ --policy.noise_scheduler_type=DDPM \ --policy.num_train_timesteps=100 \ --wandb.enable=true ``` **Key Parameters:** - **Batch Size**: 192-320 - If you have access to a GPU that can support this, you will get the best training dynamics - **Horizon**: 32 - number of action steps to predict, ~1.0 sec at 30Hz - **n_action_steps**: 24 - ~0.8 seconds at 30Hz - **Objective**: `diffusion` - start with diffusion and experiment with flow matching if generation quality is poor - **Training Steps**: >30k steps recommended for a single task ### Training Configuration Parameters #### Objective Selection Choose between diffusion and flow matching: ```bash # Diffusion objective (default) --policy.objective=diffusion \ --policy.noise_scheduler_type=DDPM \ # or "DDIM" --policy.num_train_timesteps=100 \ --policy.num_inference_steps=10 \ # For faster inference # Flow matching objective --policy.objective=flow_matching \ --policy.timestep_sampling_strategy=beta \ # or "uniform" | the beta sampling strategy performance appears much better in practice --policy.num_integration_steps=100 \ --policy.integration_method=euler \ # or "rk4" ``` #### Transformer Architecture Adjust model capacity based on dataset size: ```bash # Small datasets (< 100 examples) --policy.num_layers=4 \ --policy.hidden_dim=512 # Medium datasets (100-5k examples) - default --policy.num_layers=6 \ --policy.hidden_dim=512 # Large datasets (> 5k examples) --policy.num_layers=8 \ --policy.hidden_dim=512 ``` #### Vision Encoder Configuration ```bash # Use different CLIP model for more expressivity at the cost of inference time --policy.vision_encoder_name=openai/clip-vit-large-patch14 # Image preprocessing --policy.image_resize_shape=[XXX,YYY] \ # you may need to resize your images for inference speed ups --policy.image_crop_shape=[224,224] \ --policy.image_crop_is_random=true # Random during training, center at inference ``` #### Learning Rate Configuration The vision encoder uses a separate learning rate multiplier, where 1/10th is suggested to be the ideal staritng point: ```bash --policy.optimizer_lr=2e-5 \ --policy.vision_encoder_lr_multiplier=0.1 # Vision encoder LR = 0.1 * optimizer_lr ``` ### Training Tuning Guidelines #### 1. Flow Matching with Beta Sampling Consider switching to flow matching with beta sampling distribution for potentially improved performance: ```bash --policy.objective=flow_matching \ --policy.timestep_sampling_strategy=beta \ --policy.timestep_sampling_alpha=1.5 \ --policy.timestep_sampling_beta=1.0 \ --policy.timestep_sampling_s=0.999 ``` This hasn't been shown to be a silver bullet across every user case, but it occasionally results in smoother and more consistent actions. #### 2. Number of Transformer Layers Match model capacity to your dataset size: - **Small datasets** (< 100 examples): Reduce to 4 layers - **Large datasets** (> 5k examples): Increase to 8 layers #### 3. `horizon` Tuning The model can be sensitive to the horizon you choose. Start with around a 1 second horizon based on your control frequency: - **30 Hz frequency**: `horizon=30` - **10 Hz frequency**: `horizon=10` Then experiment with increasing from there. The horizon determines how far into the future the model predicts actions. #### 4. `n_action_steps` Sensitivity The model can also be very sensitive to `n_action_steps`. Start with it being around 0.8 seconds based on your control frequency and tune from there: - **Lower values**: More reactive but potentially less stable for long-horizon tasks - **Higher values**: Better for long-horizon execution but open-loop failures are limited in their recovery ### Inference Tuning For faster inference, use DDIM with fewer sampling steps: ```bash --policy.noise_scheduler_type=DDIM \ --policy.num_inference_steps=10 ``` ### Resuming Training To resume training from a checkpoint: ```bash lerobot-train \ --config_path=$OUTPUT_DIR/checkpoints/00001000/pretrained_model/train_config.json \ --resume=true \ --output_dir=$OUTPUT_DIR ``` The checkpoint directory should contain `model.safetensors` and `config.json` files (saved automatically during training). ## Common Failure Modes and Debugging Training these models can be finicky. Here are common failure modes and debugging approaches: ### Idling / No Motion The model may "collapse" during inference, resulting in static or no motion. This can occur when: 1. **Insufficient training data**: If you only have 20-50 examples, try to roughly double your dataset size. Once you have above 300 examples, if you're still seeing this, the task may be too complex. 2. **Multiple similar tasks**: When your dataset contains multiple similar tasks (e.g., picking up 2 different objects), the model may rely too heavily on language conditioning which might not be rich enough. **Debugging tips:** - Increase dataset size (double until you get to over 300 examples) - Train for longer, up to 100k steps, even when the loss flatlines - Check if the model is receiving proper language instructions or increase diversity of instruction ### Executing the Wrong Task Sometimes the robot will completely ignore your instruction and perform some other task. This generally only happens if you have trained on multiple tasks. **Potential causes:** - Language instruction ambiguity - Insufficient task-specific training data - Model confusion between similar tasks in the multitask dataset **Debugging tips:** - Verify language instruction specificity, especially if descriptions are similar between multiple tasks - Check task distribution in your training dataset and add weighting to the failing/ignored task - Consider task-specific fine-tuning ### Training Instability If training loss is unstable or diverging: - Try adjusting learning rate between `1e-5` and `3e-4` - Increase batch size if possible - Check that your dataset normalization is correct - Verify image preprocessing is working correctly ## Performance Considerations ### GPU Requirements - **Inference**: At least an RTX 5070 Ti (or equivalent GPU) is recommended for reasonable speed performance - **Training**: A GPU with enough VRAM to load batch sizes of >64 is ideal, which will vary depending on the number of image observations, etc ### Batch Size Recommendations - **Minimum**: 64 (less than this may result in unstable training) - **Recommended**: 256-320 (best performance, requires larger GPU) ## Example: Training on Custom Dataset Here's a complete example training on a custom dataset: ```bash lerobot-train \ --dataset.repo_id={{MY_DATASET_ID}} \ --output_dir={{MY_OUTPUT_DIR}} \ --policy.type=multi_task_dit \ --policy.device=cuda \ --batch_size=320 \ --steps=30000 \ --save_freq=1000 \ --log_freq=100 \ --eval_freq=1000 \ --policy.horizon=32 \ --policy.n_action_steps=24 \ --policy.objective=diffusion \ --policy.noise_scheduler_type=DDPM \ --policy.num_layers=6 \ --policy.hidden_dim=512 \ --policy.vision_encoder_name=openai/clip-vit-base-patch16 \ --policy.image_resize_shape=[320,240] \ --policy.image_crop_shape=[224,224] \ --wandb.enable=true \ --wandb.project=multitask_dit \ --policy.repo_id={{MY_REPO_ID}} ``` ## References For more details on the technical implementation and architecture, see: - [A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation](https://arxiv.org/abs/2507.05331) - [Large Behavior Models and Atlas Find New Footing](https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/) - [Dissecting and Open-Sourcing Multitask Diffusion Transformer Policy](https://brysonkjones.substack.com/p/dissecting-and-open-sourcing-multitask-diffusion-transformer-policy)