diff --git a/docs/source/multitask_dit.mdx b/docs/source/multitask_dit.mdx index 1c32c95b8..3bb1095d7 100644 --- a/docs/source/multitask_dit.mdx +++ b/docs/source/multitask_dit.mdx @@ -93,12 +93,17 @@ Choose between diffusion and flow matching: --policy.noise_scheduler_type=DDPM \ # or "DDIM" --policy.num_train_timesteps=100 \ --policy.num_inference_steps=10 \ # For faster inference +--policy.beta_schedule=squaredcos_cap_v2 \ # Noise schedule type +--policy.prediction_type=epsilon \ # "epsilon" (predict noise) or "sample" (predict clean) +--policy.clip_sample=true \ # Clip samples during denoising +--policy.clip_sample_range=1.0 # Clipping range [-x, x] # Flow matching objective --policy.objective=flow_matching \ --policy.timestep_sampling_strategy=beta \ # or "uniform" | the beta sampling strategy performance appears much better in practice --policy.num_integration_steps=100 \ --policy.integration_method=euler \ # or "rk4" +--policy.sigma_min=0.0 # Minimum noise in flow interpolation path ``` #### Transformer Architecture @@ -108,29 +113,67 @@ Adjust model capacity based on dataset size: ```bash # Small datasets (< 100 examples) --policy.num_layers=4 \ ---policy.hidden_dim=512 +--policy.hidden_dim=512 \ +--policy.num_heads=8 # should ideally be hidden_dim // 64 # Medium datasets (100-5k examples) - default --policy.num_layers=6 \ ---policy.hidden_dim=512 +--policy.hidden_dim=512 \ +--policy.num_heads=8 # should ideally be hidden_dim // 64 # Large datasets (> 5k examples) --policy.num_layers=8 \ ---policy.hidden_dim=512 +--policy.hidden_dim=512 \ +--policy.num_heads=8 # should ideally be hidden_dim // 64 +``` + +**Positional Encoding Options:** + +The model supports two positional encoding methods for action sequences: + +```bash +# Rotary Position Embedding (RoPE) - default, recommended +--policy.use_rope=true \ +--policy.rope_base=10000.0 # Base frequency for RoPE + +# Absolute positional encoding +--policy.use_positional_encoding=true # Disables RoPE when true +``` + +**Other Transformer Parameters:** + +```bash +--policy.dropout=0.1 # Dropout rate for DiT blocks (0.0-1.0) +--policy.timestep_embed_dim=256 # Timestep embedding dimension ``` #### Vision Encoder Configuration ```bash # Use different CLIP model for more expressivity at the cost of inference time +# experiment with larger or smaller models depending on the complexity of your tasks and size of dataset --policy.vision_encoder_name=openai/clip-vit-large-patch14 +# Use separate vision encoder per camera +# This may be useful when cameras have significantly different characteristics, but +# be wary of increased VRAM footprint. +--policy.use_separate_encoder_per_camera=true + # Image preprocessing --policy.image_resize_shape=[XXX,YYY] \ # you may need to resize your images for inference speed ups --policy.image_crop_shape=[224,224] \ --policy.image_crop_is_random=true # Random during training, center at inference ``` +#### Text Encoder Configuration + +```bash +# Use different CLIP text encoder model +# same as vision: experiment with larger or smaller models depending on the +# complexity of your tasks and size of dataset +--policy.text_encoder_name=openai/clip-vit-large-patch14 +``` + #### Learning Rate Configuration The vision encoder uses a separate learning rate multiplier, where 1/10th is suggested to be the ideal staritng point: