add more descriptions and depth to multitask dit tutorial

2026-07-06 17:41:47 +00:00 · 2025-12-11 09:45:53 -08:00
parent 43c335d0d7
commit 8e3a1e8945
1 changed files with 46 additions and 3 deletions
@@ -93,12 +93,17 @@ Choose between diffusion and flow matching:
 --policy.noise_scheduler_type=DDPM \  # or "DDIM"
 --policy.num_train_timesteps=100 \
 --policy.num_inference_steps=10 \  # For faster inference
+--policy.beta_schedule=squaredcos_cap_v2 \  # Noise schedule type
+--policy.prediction_type=epsilon \  # "epsilon" (predict noise) or "sample" (predict clean)
+--policy.clip_sample=true \  # Clip samples during denoising
+--policy.clip_sample_range=1.0  # Clipping range [-x, x]

 # Flow matching objective
 --policy.objective=flow_matching \
 --policy.timestep_sampling_strategy=beta \  # or "uniform" | the beta sampling strategy performance appears much better in practice
 --policy.num_integration_steps=100 \
 --policy.integration_method=euler \  # or "rk4"
+--policy.sigma_min=0.0  # Minimum noise in flow interpolation path
 ```

 #### Transformer Architecture
@@ -108,29 +113,67 @@ Adjust model capacity based on dataset size:
 ```bash
 # Small datasets (< 100 examples)
 --policy.num_layers=4 \
--policy.hidden_dim=512
+--policy.hidden_dim=512 \
+--policy.num_heads=8  # should ideally be hidden_dim // 64

 # Medium datasets (100-5k examples) - default
 --policy.num_layers=6 \
--policy.hidden_dim=512
+--policy.hidden_dim=512 \
+--policy.num_heads=8  # should ideally be hidden_dim // 64

 # Large datasets (> 5k examples)
 --policy.num_layers=8 \
--policy.hidden_dim=512
+--policy.hidden_dim=512 \
+--policy.num_heads=8   # should ideally be hidden_dim // 64
+```
+
+**Positional Encoding Options:**
+
+The model supports two positional encoding methods for action sequences:
+
+```bash
+# Rotary Position Embedding (RoPE) - default, recommended
+--policy.use_rope=true \
+--policy.rope_base=10000.0  # Base frequency for RoPE
+
+# Absolute positional encoding
+--policy.use_positional_encoding=true  # Disables RoPE when true
+```
+
+**Other Transformer Parameters:**
+
+```bash
+--policy.dropout=0.1  # Dropout rate for DiT blocks (0.0-1.0)
+--policy.timestep_embed_dim=256  # Timestep embedding dimension
 ```

 #### Vision Encoder Configuration

 ```bash
 # Use different CLIP model for more expressivity at the cost of inference time
+# experiment with larger or smaller models depending on the complexity of your tasks and size of dataset
 --policy.vision_encoder_name=openai/clip-vit-large-patch14

+# Use separate vision encoder per camera
+# This may be useful when cameras have significantly different characteristics, but
+# be wary of increased VRAM footprint.
+--policy.use_separate_encoder_per_camera=true
+
 # Image preprocessing
 --policy.image_resize_shape=[XXX,YYY] \ # you may need to resize your images for inference speed ups
 --policy.image_crop_shape=[224,224] \
 --policy.image_crop_is_random=true  # Random during training, center at inference
 ```

+#### Text Encoder Configuration
+
+```bash
+# Use different CLIP text encoder model
+# same as vision: experiment with larger or smaller models depending on the
+# complexity of your tasks and size of dataset
+--policy.text_encoder_name=openai/clip-vit-large-patch14
+```
+
 #### Learning Rate Configuration

 The vision encoder uses a separate learning rate multiplier, where 1/10th is suggested to be the ideal staritng point: