docs: update document in response to Simplify configs PR (#1596)

* docs: update document input/output_shapes -> input/output_features * fix inconsistent quote (suggested by copilot reviewer) * docs: shapes => PolicyFeature * docs: relfect normalization_mapping and remove outdated
2026-07-24 10:16:09 +00:00 · 2026-02-03 04:05:58 +09:00
parent b18cef2e26
commit 9c24a09665
5 changed files with 34 additions and 73 deletions
@@ -45,12 +45,12 @@ class PreTrainedConfig(draccus.ChoiceRegistry, HubMixin, abc.ABC):  # type: igno
    Args:
        n_obs_steps: Number of environment steps worth of observations to pass to the policy (takes the
            current step and additional steps going back).
-        input_shapes: A dictionary defining the shapes of the input data for the policy.
+        input_features: A dictionary defining the PolicyFeature of the input data for the policy. The key represents
-        output_shapes: A dictionary defining the shapes of the output data for the policy.
+            the input data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
-        input_normalization_modes: A dictionary with key representing the modality and the value specifies the
+        output_features: A dictionary defining the PolicyFeature of the output data for the policy. The key represents
-            normalization mode to apply.
+            the output data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
-        output_normalization_modes: Similar dictionary as `input_normalization_modes`, but to unnormalize to
+        normalization_mapping: A dictionary that maps from a str value of FeatureType (e.g., "STATE", "VISUAL") to
-            the original scale.
+            a corresponding NormalizationMode (e.g., NormalizationMode.MIN_MAX)
    """
    n_obs_steps: int = 1
@@ -28,7 +28,7 @@ class ACTConfig(PreTrainedConfig):
    Defaults are configured for training on bimanual Aloha tasks like "insertion" or "transfer".
    The parameters you will most likely need to change are the ones which depend on the environment / sensors.
-    Those are: `input_shapes` and 'output_shapes`.
+    Those are: `input_features` and `output_features`.
    Notes on the inputs and outputs:
        - Either:
@@ -48,21 +48,12 @@ class ACTConfig(PreTrainedConfig):
            This should be no greater than the chunk size. For example, if the chunk size size 100, you may
            set this to 50. This would mean that the model predicts 100 steps worth of actions, runs 50 in the
            environment, and throws the other 50 out.
-        input_shapes: A dictionary defining the shapes of the input data for the policy. The key represents
+        input_features: A dictionary defining the PolicyFeature of the input data for the policy. The key represents
-            the input data name, and the value is a list indicating the dimensions of the corresponding data.
+            the input data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
-            For example, "observation.image" refers to an input from a camera with dimensions [3, 96, 96],
+        output_features: A dictionary defining the PolicyFeature of the output data for the policy. The key represents
-            indicating it has three color channels and 96x96 resolution. Importantly, `input_shapes` doesn't
+            the output data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
-            include batch dimension or temporal dimension.
+        normalization_mapping: A dictionary that maps from a str value of FeatureType (e.g., "STATE", "VISUAL") to
-        output_shapes: A dictionary defining the shapes of the output data for the policy. The key represents
+            a corresponding NormalizationMode (e.g., NormalizationMode.MIN_MAX)
            the output data name, and the value is a list indicating the dimensions of the corresponding data.
            For example, "action" refers to an output shape of [14], indicating 14-dimensional actions.
            Importantly, `output_shapes` doesn't include batch dimension or temporal dimension.
        input_normalization_modes: A dictionary with key representing the modality (e.g. "observation.state"),
            and the value specifies the normalization mode to apply. The two available modes are "mean_std"
            which subtracts the mean and divides by the standard deviation and "min_max" which rescale in a
            [-1, 1] range.
        output_normalization_modes: Similar dictionary as `normalize_input_modes`, but to unnormalize to the
            original scale. Note that this is also used for normalizing the training targets.
        vision_backbone: Name of the torchvision resnet backbone to use for encoding images.
        pretrained_backbone_weights: Pretrained weights from torchvision to initialize the backbone.
            `None` means no pretrained weights.
@@ -30,7 +30,7 @@ class DiffusionConfig(PreTrainedConfig):
    Defaults are configured for training with PushT providing proprioceptive and single camera observations.
    The parameters you will most likely need to change are the ones which depend on the environment / sensors.
-    Those are: `input_shapes` and `output_shapes`.
+    Those are: `input_features` and `output_features`.
    Notes on the inputs and outputs:
        - "observation.state" is required as an input key.
@@ -48,21 +48,12 @@ class DiffusionConfig(PreTrainedConfig):
        horizon: Diffusion model action prediction size as detailed in `DiffusionPolicy.select_action`.
        n_action_steps: The number of action steps to run in the environment for one invocation of the policy.
            See `DiffusionPolicy.select_action` for more details.
-        input_shapes: A dictionary defining the shapes of the input data for the policy. The key represents
+        input_features: A dictionary defining the PolicyFeature of the input data for the policy. The key represents
-            the input data name, and the value is a list indicating the dimensions of the corresponding data.
+            the input data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
-            For example, "observation.image" refers to an input from a camera with dimensions [3, 96, 96],
+        output_features: A dictionary defining the PolicyFeature of the output data for the policy. The key represents
-            indicating it has three color channels and 96x96 resolution. Importantly, `input_shapes` doesn't
+            the output data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
-            include batch dimension or temporal dimension.
+        normalization_mapping: A dictionary that maps from a str value of FeatureType (e.g., "STATE", "VISUAL") to
-        output_shapes: A dictionary defining the shapes of the output data for the policy. The key represents
+            a corresponding NormalizationMode (e.g., NormalizationMode.MIN_MAX)
            the output data name, and the value is a list indicating the dimensions of the corresponding data.
            For example, "action" refers to an output shape of [14], indicating 14-dimensional actions.
            Importantly, `output_shapes` doesn't include batch dimension or temporal dimension.
        input_normalization_modes: A dictionary with key representing the modality (e.g. "observation.state"),
            and the value specifies the normalization mode to apply. The two available modes are "mean_std"
            which subtracts the mean and divides by the standard deviation and "min_max" which rescale in a
            [-1, 1] range.
        output_normalization_modes: Similar dictionary as `normalize_input_modes`, but to unnormalize to the
            original scale. Note that this is also used for normalizing the training targets.
        vision_backbone: Name of the torchvision resnet backbone to use for encoding images.
        crop_shape: (H, W) shape to crop images to as a preprocessing step for the vision backbone. Must fit
            within the image size. If None, no cropping is done.
@@ -30,7 +30,7 @@ class TDMPCConfig(PreTrainedConfig):
    camera observations.
    The parameters you will most likely need to change are the ones which depend on the environment / sensors.
-    Those are: `input_shapes`, `output_shapes`, and perhaps `max_random_shift_ratio`.
+    Those are: `input_features`, `output_features`, and perhaps `max_random_shift_ratio`.
    Args:
        n_action_repeats: The number of times to repeat the action returned by the planning. (hint: Google
@@ -40,24 +40,12 @@ class TDMPCConfig(PreTrainedConfig):
            is an alternative to using action repeats. If this is set to more than 1, then we require
            `n_action_repeats == 1`, `use_mpc == True` and `n_action_steps <= horizon`. Note that this
            approach of using multiple steps from the plan is not in the original implementation.
-        input_shapes: A dictionary defining the shapes of the input data for the policy. The key represents
+        input_features: A dictionary defining the PolicyFeature of the input data for the policy. The key represents
-            the input data name, and the value is a list indicating the dimensions of the corresponding data.
+            the input data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
-            For example, "observation.image" refers to an input from a camera with dimensions [3, 96, 96],
+        output_features: A dictionary defining the PolicyFeature of the output data for the policy. The key represents
-            indicating it has three color channels and 96x96 resolution. Importantly, `input_shapes` doesn't
+            the output data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
-            include batch dimension or temporal dimension.
+        normalization_mapping: A dictionary that maps from a str value of FeatureType (e.g., "STATE", "VISUAL") to
-        output_shapes: A dictionary defining the shapes of the output data for the policy. The key represents
+            a corresponding NormalizationMode (e.g., NormalizationMode.MIN_MAX)
            the output data name, and the value is a list indicating the dimensions of the corresponding data.
            For example, "action" refers to an output shape of [14], indicating 14-dimensional actions.
            Importantly, `output_shapes` doesn't include batch dimension or temporal dimension.
        input_normalization_modes: A dictionary with key representing the modality (e.g. "observation.state"),
            and the value specifies the normalization mode to apply. The two available modes are "mean_std"
            which subtracts the mean and divides by the standard deviation and "min_max" which rescale in a
            [-1, 1] range. Note that here this defaults to None meaning inputs are not normalized. This is to
            match the original implementation.
        output_normalization_modes: Similar dictionary as `normalize_input_modes`, but to unnormalize to the
            original scale. Note that this is also used for normalizing the training targets. NOTE: Clipping
            to [-1, +1] is used during MPPI/CEM. Therefore, it is recommended that you stick with "min_max"
            normalization mode here.
        image_encoder_hidden_dim: Number of channels for the convolutional layers used for image encoding.
        state_encoder_hidden_dim: Hidden dimension for MLP used for state vector encoding.
        latent_dim: Observation's latent embedding dimension.
@@ -32,7 +32,7 @@ class VQBeTConfig(PreTrainedConfig):
    Defaults are configured for training with PushT providing proprioceptive and single camera observations.
    The parameters you will most likely need to change are the ones which depend on the environment / sensors.
-    Those are: `input_shapes` and `output_shapes`.
+    Those are: `input_features` and `output_features`.
    Notes on the inputs and outputs:
        - "observation.state" is required as an input key.
@@ -46,21 +46,12 @@ class VQBeTConfig(PreTrainedConfig):
            current step and additional steps going back).
        n_action_pred_token: Total number of current token and future tokens that VQ-BeT predicts.
        action_chunk_size: Action chunk size of each action prediction token.
-        input_shapes: A dictionary defining the shapes of the input data for the policy.
+        input_features: A dictionary defining the PolicyFeature of the input data for the policy. The key represents
-            The key represents the input data name, and the value is a list indicating the dimensions
+            the input data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
-            of the corresponding data. For example, "observation.image" refers to an input from
+        output_features: A dictionary defining the PolicyFeature of the output data for the policy. The key represents
-            a camera with dimensions [3, 96, 96], indicating it has three color channels and 96x96 resolution.
+            the output data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
-            Importantly, shapes doesnt include batch dimension or temporal dimension.
+        normalization_mapping: A dictionary that maps from a str value of FeatureType (e.g., "STATE", "VISUAL") to
-        output_shapes: A dictionary defining the shapes of the output data for the policy.
+            a corresponding NormalizationMode (e.g., NormalizationMode.MIN_MAX)
            The key represents the output data name, and the value is a list indicating the dimensions
            of the corresponding data. For example, "action" refers to an output shape of [14], indicating
            14-dimensional actions. Importantly, shapes doesnt include batch dimension or temporal dimension.
        input_normalization_modes: A dictionary with key representing the modality (e.g. "observation.state"),
            and the value specifies the normalization mode to apply. The two available modes are "mean_std"
            which subtracts the mean and divides by the standard deviation and "min_max" which rescale in a
            [-1, 1] range.
        output_normalization_modes: Similar dictionary as `normalize_input_modes`, but to unnormalize to the
            original scale. Note that this is also used for normalizing the training targets.
        vision_backbone: Name of the torchvision resnet backbone to use for encoding images.
        crop_shape: (H, W) shape to crop images to as a preprocessing step for the vision backbone. Must fit
            within the image size. If None, no cropping is done.