# Multi-Task DiT Policy

Multi-Task Diffusion Transformer (DiT) Policy is an evolution of the original Diffusion Policy architecture, which leverages a large DiT with text and vision conditioning for multi-task robot learning. This implementation supports both diffusion and flow matching objectives for action generation, enabling robots to perform diverse manipulation tasks conditioned on language instructions.

## Model Overview

The model uses:

- **CLIP Vision Encoder**: Processes RGB images from multiple camera views
- **CLIP Text Encoder**: Encodes language task instructions (frozen weights with learnable projection)
- **Diffusion Transformer**: Predicts action sequences conditioned on observations and language
- **Two Objectives**: Supports both diffusion (DDPM/DDIM) and flow matching for action generation

This model is exciting because you can achieve extremely high dexterity, competitive with multi-billion parameter
VLAs, with only ~450M parameters and significantly less training.

## Installation Requirements

Multi-Task DiT Policy has additional dependencies. Install it with:

```bash
pip install lerobot[multi_task_dit]
```

This will install all necessary dependencies including the HuggingFace Transformers library for CLIP models.

## Usage

To use Multi-Task DiT in your LeRobot configuration, specify the policy type as:

```python
policy.type=multi_task_dit
```

## Training

### Basic Training Command

Here's a complete training command for training Multi-Task DiT on your dataset:

```bash
lerobot-train \
  --dataset.repo_id=YOUR_DATASET \
  --output_dir=./outputs/multitask_dit_training \
  --batch_size=32 \
  --steps=5000 \
  --save_freq=500 \
  --log_freq=100 \
  --policy.type=multi_task_dit \
  --policy.device=cuda \
  --policy.repo_id="HF_USER/multitask-dit-your-robot" \
  --wandb.enable=true
```

### Recommended Hyperparameters and Dataset Details (30Hz Control Frequency)

For reliable performance, start with these suggested default hyperparameters:

```bash
lerobot-train \
  --dataset.repo_id=YOUR_DATASET \
  --output_dir=./outputs/mutitask_dit_training \
  --batch_size=320 \
  --steps=30000 \
  --policy.type=multi_task_dit \
  --policy.device=cuda \
  --policy.horizon=32 \
  --policy.n_action_steps=24 \
  --policy.objective=diffusion \
  --policy.noise_scheduler_type=DDPM \
  --policy.num_train_timesteps=100 \
  --policy.repo_id="HF_USER/multitask-dit-your-robot" \
  --wandb.enable=true
```

**Key Parameters:**

- **Batch Size**: 192-320 - If you have access to a GPU that can support this, you will get the best training dynamics
- **Horizon**: 32 - number of action steps to predict, ~1.0 sec at 30Hz
- **n_action_steps**: 24 - ~0.8 seconds at 30Hz
- **Objective**: `diffusion` - start with diffusion and experiment with flow matching if generation quality is poor
- **Training Steps**: >30k steps recommended for a single task

### Training Configuration Parameters

#### Objective Selection

Choose between diffusion and flow matching:

```bash
# Diffusion objective (default)
--policy.objective=diffusion \
--policy.noise_scheduler_type=DDPM \  # or "DDIM"
--policy.num_train_timesteps=100 \
--policy.num_inference_steps=10 \  # For faster inference
--policy.beta_schedule=squaredcos_cap_v2 \  # Noise schedule type
--policy.prediction_type=epsilon \  # "epsilon" (predict noise) or "sample" (predict clean)
--policy.clip_sample=true \  # Clip samples during denoising
--policy.clip_sample_range=1.0  # Clipping range [-x, x]

# Flow matching objective
--policy.objective=flow_matching \
--policy.timestep_sampling_strategy=beta \  # or "uniform" | the beta sampling strategy performance appears much better in practice
--policy.num_integration_steps=100 \
--policy.integration_method=euler \  # or "rk4"
--policy.sigma_min=0.0  # Minimum noise in flow interpolation path
```

#### Transformer Architecture

Adjust model capacity based on dataset size:

```bash
# Small datasets (< 100 examples)
--policy.num_layers=4 \
--policy.hidden_dim=512 \
--policy.num_heads=8  # should ideally be hidden_dim // 64

# Medium datasets (100-5k examples) - default
--policy.num_layers=6 \
--policy.hidden_dim=512 \
--policy.num_heads=8  # should ideally be hidden_dim // 64

# Large datasets (> 5k examples)
--policy.num_layers=8 \
--policy.hidden_dim=512 \
--policy.num_heads=8   # should ideally be hidden_dim // 64
```

**Positional Encoding Options:**

The model supports two positional encoding methods for action sequences:

```bash
# Rotary Position Embedding (RoPE) - default, recommended
--policy.use_rope=true \
--policy.rope_base=10000.0  # Base frequency for RoPE

# Absolute positional encoding
--policy.use_positional_encoding=true  # Disables RoPE when true
```

**Other Transformer Parameters:**

```bash
--policy.dropout=0.1  # Dropout rate for DiT blocks (0.0-1.0)
--policy.timestep_embed_dim=256  # Timestep embedding dimension
```

#### Vision Encoder Configuration

```bash
# Use different CLIP model for more expressivity at the cost of inference time
# experiment with larger or smaller models depending on the complexity of your tasks and size of dataset
--policy.vision_encoder_name=openai/clip-vit-large-patch14

# Use separate vision encoder per camera
# This may be useful when cameras have significantly different characteristics, but
# be wary of increased VRAM footprint.
--policy.use_separate_rgb_encoder_per_camera=true

# Image preprocessing
--policy.image_resize_shape=[XXX,YYY] \ # you may need to resize your images for inference speed ups
--policy.image_crop_shape=[224,224] \
--policy.image_crop_is_random=true  # Random during training, center at inference
```

#### Text Encoder Configuration

```bash
# Use different CLIP text encoder model
# same as vision: experiment with larger or smaller models depending on the
# complexity of your tasks and size of dataset
--policy.text_encoder_name=openai/clip-vit-large-patch14
```

#### Learning Rate Configuration

The vision encoder uses a separate learning rate multiplier, where 1/10th is suggested to be the ideal staritng point:

```bash
--policy.optimizer_lr=2e-5 \
--policy.vision_encoder_lr_multiplier=0.1  # Vision encoder LR = 0.1 * optimizer_lr
```

### Training Tuning Guidelines

#### 1. Flow Matching with Beta Sampling

The original diffusion implementation here is based on the work described in [TRI's LBM paper](https://arxiv.org/abs/2507.05331)

Additionally, we have implemented a flow-matching objective, which is described at a high-level in [Boston Dynamics blog post](https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/).

Consider testing the flow-matching objective and evaluating performance differences for your task:

```bash
--policy.objective=flow_matching \
--policy.timestep_sampling_strategy=beta \
--policy.timestep_sampling_alpha=1.5 \
--policy.timestep_sampling_beta=1.0 \
--policy.timestep_sampling_s=0.999
```

This hasn't been shown to be a silver bullet across every user case, but it occasionally results in smoother and more consistent actions.

#### 2. Number of Transformer Layers

Match model capacity to your dataset size:

- **Small datasets** (< 100 examples): Reduce to 4 layers
- **Large datasets** (> 5k examples): Increase to 8 layers

#### 3. `horizon` Tuning

The model can be sensitive to the horizon you choose. Start with around a 1 second horizon based on your control frequency:

- **30 Hz frequency**: `horizon=30`
- **10 Hz frequency**: `horizon=10`

Then experiment with increasing from there. The horizon determines how far into the future the model predicts actions.

#### 4. `n_action_steps` Sensitivity

The model can also be very sensitive to `n_action_steps`. Start with it being around 0.8 seconds based on your control frequency and tune from there:

- **Lower values**: More reactive but potentially less stable for long-horizon tasks
- **Higher values**: Better for long-horizon execution but open-loop failures are limited in their recovery

### Inference Tuning

For faster inference, use DDIM with fewer sampling steps:

```bash
--policy.noise_scheduler_type=DDIM \
--policy.num_inference_steps=10
```

### Resuming Training

To resume training from a checkpoint:

```bash
lerobot-train \
  --config_path=./outputs/mutitask_dit_training/checkpoints/last/pretrained_model/train_config.json \
  --resume=true
```

The checkpoint directory should contain `model.safetensors` and `config.json` files (saved automatically during training). When resuming, the configuration is loaded from the checkpoint, so you don't need to specify other parameters.

## Common Failure Modes and Debugging

Training these models can be finicky. Here are common failure modes and debugging approaches:

### Idling / No Motion

The model may "collapse" during inference, resulting in static or no motion. This can occur when:

1. **Insufficient training data**: If you only have 20-50 examples, try to roughly double your dataset size. Once you have above 300 examples, if you're still seeing this, the task may be too complex.

2. **Multiple similar tasks**: When your dataset contains multiple similar tasks (e.g., picking up 2 different objects), the model may rely too heavily on language conditioning which might not be rich enough.

**Debugging tips:**

- Increase dataset size (double until you get to over 300 examples)
- Train for longer, up to 100k steps, even when the loss flatlines
- Check if the model is receiving proper language instructions or increase diversity of instruction

### Executing the Wrong Task

Sometimes the robot will completely ignore your instruction and perform some other task. This generally only happens if you have trained on multiple tasks.

**Potential causes:**

- Language instruction ambiguity
- Insufficient task-specific training data
- Model confusion between similar tasks in the multitask dataset

**Debugging tips:**

- Verify language instruction specificity, especially if descriptions are similar between multiple tasks
- Check task distribution in your training dataset and add weighting to the failing/ignored task
- Consider task-specific fine-tuning

### Training Instability

If training loss is unstable or diverging:

- Try adjusting learning rate between `1e-5` and `3e-4`
- Increase batch size if possible
- Check that your dataset normalization is correct
- Verify image preprocessing is working correctly

## Performance Considerations

### GPU Requirements

- **Inference**: At least an RTX 5070 Ti (or equivalent GPU) is recommended for reasonable speed performance
- **Training**: A GPU with enough VRAM to load batch sizes of >64 is ideal, which will vary depending on the number of image observations, etc

### Batch Size Recommendations

- **Minimum**: 64 (less than this may result in unstable training)
- **Recommended**: 256-320 (best performance, requires larger GPU)

## Example: Training on Custom Dataset

Here's a complete example training on a custom dataset:

```bash
lerobot-train \
  --dataset.repo_id=YOUR_DATASET \
  --output_dir=./outputs/mutitask_dit_training \
  --batch_size=320 \
  --steps=30000 \
  --save_freq=1000 \
  --log_freq=100 \
  --eval_freq=1000 \
  --policy.type=multi_task_dit \
  --policy.device=cuda \
  --policy.horizon=32 \
  --policy.n_action_steps=24 \
  --policy.objective=diffusion \
  --policy.noise_scheduler_type=DDPM \
  --policy.num_layers=6 \
  --policy.hidden_dim=512 \
  --policy.vision_encoder_name=openai/clip-vit-base-patch16 \
  --policy.image_resize_shape=[320,240] \
  --policy.image_crop_shape=[224,224] \
  --policy.repo_id="HF_USER/multitask-dit-your-robot" \
  --wandb.enable=true \
  --wandb.project=multitask_dit
```

## References

For more details on the technical implementation and architecture, see:

- [A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation](https://arxiv.org/abs/2507.05331)
- [Large Behavior Models and Atlas Find New Footing](https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/)
- [Dissecting and Open-Sourcing Multitask Diffusion Transformer Policy](https://brysonkjones.substack.com/p/dissecting-and-open-sourcing-multitask-diffusion-transformer-policy)