# Multi-Task DiT Policy Multi-Task Diffusion Transformer (DiT) Policy is an evolution of the original Diffusion Policy architecture, which leverages a large DiT with text and vision conditioning for multi-task robot learning. This implementation supports both diffusion and flow matching objectives for action generation, enabling robots to perform diverse manipulation tasks conditioned on language instructions. ## Model Overview The model uses: - **CLIP Vision Encoder**: Processes RGB images from multiple camera views - **CLIP Text Encoder**: Encodes language task instructions (frozen weights with learnable projection) - **Diffusion Transformer**: Predicts action sequences conditioned on observations and language - **Two Objectives**: Supports both diffusion (DDPM/DDIM) and flow matching for action generation This model is exciting because you can achieve extremely high dexterity, competitive with multi-billion parameter VLAs, with only ~450M parameters and significantly less training. ## Installation Requirements Multi-Task DiT Policy has additional dependencies. Install it with: ```bash pip install lerobot[multi_task_dit] ``` This will install all necessary dependencies including the HuggingFace Transformers library for CLIP models. ## Usage To use Multi-Task DiT in your LeRobot configuration, specify the policy type as: ```python policy.type=multi_task_dit ``` ## Training ### Basic Training Command Here's a complete training command for training Multi-Task DiT on your dataset: ```bash lerobot-train \ --dataset.repo_id=YOUR_DATASET \ --output_dir=./outputs/multitask_dit_training \ --batch_size=32 \ --steps=5000 \ --save_freq=500 \ --log_freq=100 \ --policy.type=multi_task_dit \ --policy.device=cuda \ --policy.repo_id="HF_USER/multitask-dit-your-robot" \ --wandb.enable=true ``` ### Recommended Hyperparameters and Dataset Details (30Hz Control Frequency) For reliable performance, start with these suggested default hyperparameters: ```bash lerobot-train \ --dataset.repo_id=YOUR_DATASET \ --output_dir=./outputs/mutitask_dit_training \ --batch_size=320 \ --steps=30000 \ --policy.type=multi_task_dit \ --policy.device=cuda \ --policy.horizon=32 \ --policy.n_action_steps=24 \ --policy.objective=diffusion \ --policy.noise_scheduler_type=DDPM \ --policy.num_train_timesteps=100 \ --policy.repo_id="HF_USER/multitask-dit-your-robot" \ --wandb.enable=true ``` **Key Parameters:** - **Batch Size**: 192-320 - If you have access to a GPU that can support this, you will get the best training dynamics - **Horizon**: 32 - number of action steps to predict, ~1.0 sec at 30Hz - **n_action_steps**: 24 - ~0.8 seconds at 30Hz - **Objective**: `diffusion` - start with diffusion and experiment with flow matching if generation quality is poor - **Training Steps**: >30k steps recommended for a single task ### Training Configuration Parameters #### Objective Selection Choose between diffusion and flow matching: ```bash # Diffusion objective (default) --policy.objective=diffusion \ --policy.noise_scheduler_type=DDPM \ # or "DDIM" --policy.num_train_timesteps=100 \ --policy.num_inference_steps=10 \ # For faster inference --policy.beta_schedule=squaredcos_cap_v2 \ # Noise schedule type --policy.prediction_type=epsilon \ # "epsilon" (predict noise) or "sample" (predict clean) --policy.clip_sample=true \ # Clip samples during denoising --policy.clip_sample_range=1.0 # Clipping range [-x, x] # Flow matching objective --policy.objective=flow_matching \ --policy.timestep_sampling_strategy=beta \ # or "uniform" | the beta sampling strategy performance appears much better in practice --policy.num_integration_steps=100 \ --policy.integration_method=euler \ # or "rk4" --policy.sigma_min=0.0 # Minimum noise in flow interpolation path ``` #### Transformer Architecture Adjust model capacity based on dataset size: ```bash # Small datasets (< 100 examples) --policy.num_layers=4 \ --policy.hidden_dim=512 \ --policy.num_heads=8 # should ideally be hidden_dim // 64 # Medium datasets (100-5k examples) - default --policy.num_layers=6 \ --policy.hidden_dim=512 \ --policy.num_heads=8 # should ideally be hidden_dim // 64 # Large datasets (> 5k examples) --policy.num_layers=8 \ --policy.hidden_dim=512 \ --policy.num_heads=8 # should ideally be hidden_dim // 64 ``` **Positional Encoding Options:** The model supports two positional encoding methods for action sequences: ```bash # Rotary Position Embedding (RoPE) - default, recommended --policy.use_rope=true \ --policy.rope_base=10000.0 # Base frequency for RoPE # Absolute positional encoding --policy.use_positional_encoding=true # Disables RoPE when true ``` **Other Transformer Parameters:** ```bash --policy.dropout=0.1 # Dropout rate for DiT blocks (0.0-1.0) --policy.timestep_embed_dim=256 # Timestep embedding dimension ``` #### Vision Encoder Configuration ```bash # Use different CLIP model for more expressivity at the cost of inference time # experiment with larger or smaller models depending on the complexity of your tasks and size of dataset --policy.vision_encoder_name=openai/clip-vit-large-patch14 # Use separate vision encoder per camera # This may be useful when cameras have significantly different characteristics, but # be wary of increased VRAM footprint. --policy.use_separate_rgb_encoder_per_camera=true # Image preprocessing --policy.image_resize_shape=[XXX,YYY] \ # you may need to resize your images for inference speed ups --policy.image_crop_shape=[224,224] \ --policy.image_crop_is_random=true # Random during training, center at inference ``` #### Text Encoder Configuration ```bash # Use different CLIP text encoder model # same as vision: experiment with larger or smaller models depending on the # complexity of your tasks and size of dataset --policy.text_encoder_name=openai/clip-vit-large-patch14 ``` #### Learning Rate Configuration The vision encoder uses a separate learning rate multiplier, where 1/10th is suggested to be the ideal staritng point: ```bash --policy.optimizer_lr=2e-5 \ --policy.vision_encoder_lr_multiplier=0.1 # Vision encoder LR = 0.1 * optimizer_lr ``` ### Training Tuning Guidelines #### 1. Flow Matching with Beta Sampling The original diffusion implementation here is based on the work described in [TRI's LBM paper](https://arxiv.org/abs/2507.05331) Additionally, we have implemented a flow-matching objective, which is described at a high-level in [Boston Dynamics blog post](https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/). Consider testing the flow-matching objective and evaluating performance differences for your task: ```bash --policy.objective=flow_matching \ --policy.timestep_sampling_strategy=beta \ --policy.timestep_sampling_alpha=1.5 \ --policy.timestep_sampling_beta=1.0 \ --policy.timestep_sampling_s=0.999 ``` This hasn't been shown to be a silver bullet across every user case, but it occasionally results in smoother and more consistent actions. #### 2. Number of Transformer Layers Match model capacity to your dataset size: - **Small datasets** (< 100 examples): Reduce to 4 layers - **Large datasets** (> 5k examples): Increase to 8 layers #### 3. `horizon` Tuning The model can be sensitive to the horizon you choose. Start with around a 1 second horizon based on your control frequency: - **30 Hz frequency**: `horizon=30` - **10 Hz frequency**: `horizon=10` Then experiment with increasing from there. The horizon determines how far into the future the model predicts actions. #### 4. `n_action_steps` Sensitivity The model can also be very sensitive to `n_action_steps`. Start with it being around 0.8 seconds based on your control frequency and tune from there: - **Lower values**: More reactive but potentially less stable for long-horizon tasks - **Higher values**: Better for long-horizon execution but open-loop failures are limited in their recovery ### Inference Tuning For faster inference, use DDIM with fewer sampling steps: ```bash --policy.noise_scheduler_type=DDIM \ --policy.num_inference_steps=10 ``` ### Resuming Training To resume training from a checkpoint: ```bash lerobot-train \ --config_path=./outputs/mutitask_dit_training/checkpoints/last/pretrained_model/train_config.json \ --resume=true ``` The checkpoint directory should contain `model.safetensors` and `config.json` files (saved automatically during training). When resuming, the configuration is loaded from the checkpoint, so you don't need to specify other parameters. ## Common Failure Modes and Debugging Training these models can be finicky. Here are common failure modes and debugging approaches: ### Idling / No Motion The model may "collapse" during inference, resulting in static or no motion. This can occur when: 1. **Insufficient training data**: If you only have 20-50 examples, try to roughly double your dataset size. Once you have above 300 examples, if you're still seeing this, the task may be too complex. 2. **Multiple similar tasks**: When your dataset contains multiple similar tasks (e.g., picking up 2 different objects), the model may rely too heavily on language conditioning which might not be rich enough. **Debugging tips:** - Increase dataset size (double until you get to over 300 examples) - Train for longer, up to 100k steps, even when the loss flatlines - Check if the model is receiving proper language instructions or increase diversity of instruction ### Executing the Wrong Task Sometimes the robot will completely ignore your instruction and perform some other task. This generally only happens if you have trained on multiple tasks. **Potential causes:** - Language instruction ambiguity - Insufficient task-specific training data - Model confusion between similar tasks in the multitask dataset **Debugging tips:** - Verify language instruction specificity, especially if descriptions are similar between multiple tasks - Check task distribution in your training dataset and add weighting to the failing/ignored task - Consider task-specific fine-tuning ### Training Instability If training loss is unstable or diverging: - Try adjusting learning rate between `1e-5` and `3e-4` - Increase batch size if possible - Check that your dataset normalization is correct - Verify image preprocessing is working correctly ## Performance Considerations ### GPU Requirements - **Inference**: At least an RTX 5070 Ti (or equivalent GPU) is recommended for reasonable speed performance - **Training**: A GPU with enough VRAM to load batch sizes of >64 is ideal, which will vary depending on the number of image observations, etc ### Batch Size Recommendations - **Minimum**: 64 (less than this may result in unstable training) - **Recommended**: 256-320 (best performance, requires larger GPU) ## Example: Training on Custom Dataset Here's a complete example training on a custom dataset: ```bash lerobot-train \ --dataset.repo_id=YOUR_DATASET \ --output_dir=./outputs/mutitask_dit_training \ --batch_size=320 \ --steps=30000 \ --save_freq=1000 \ --log_freq=100 \ --eval_freq=1000 \ --policy.type=multi_task_dit \ --policy.device=cuda \ --policy.horizon=32 \ --policy.n_action_steps=24 \ --policy.objective=diffusion \ --policy.noise_scheduler_type=DDPM \ --policy.num_layers=6 \ --policy.hidden_dim=512 \ --policy.vision_encoder_name=openai/clip-vit-base-patch16 \ --policy.image_resize_shape=[320,240] \ --policy.image_crop_shape=[224,224] \ --policy.repo_id="HF_USER/multitask-dit-your-robot" \ --wandb.enable=true \ --wandb.project=multitask_dit ``` ## References For more details on the technical implementation and architecture, see: - [A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation](https://arxiv.org/abs/2507.05331) - [Large Behavior Models and Atlas Find New Footing](https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/) - [Dissecting and Open-Sourcing Multitask Diffusion Transformer Policy](https://brysonkjones.substack.com/p/dissecting-and-open-sourcing-multitask-diffusion-transformer-policy)