mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-27 14:39:43 +00:00
docs: update document in response to Simplify configs PR (#1596)
* docs: update document input/output_shapes -> input/output_features * fix inconsistent quote (suggested by copilot reviewer) * docs: shapes => PolicyFeature * docs: relfect normalization_mapping and remove outdated
This commit is contained in:
@@ -45,12 +45,12 @@ class PreTrainedConfig(draccus.ChoiceRegistry, HubMixin, abc.ABC): # type: igno
|
|||||||
Args:
|
Args:
|
||||||
n_obs_steps: Number of environment steps worth of observations to pass to the policy (takes the
|
n_obs_steps: Number of environment steps worth of observations to pass to the policy (takes the
|
||||||
current step and additional steps going back).
|
current step and additional steps going back).
|
||||||
input_shapes: A dictionary defining the shapes of the input data for the policy.
|
input_features: A dictionary defining the PolicyFeature of the input data for the policy. The key represents
|
||||||
output_shapes: A dictionary defining the shapes of the output data for the policy.
|
the input data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
|
||||||
input_normalization_modes: A dictionary with key representing the modality and the value specifies the
|
output_features: A dictionary defining the PolicyFeature of the output data for the policy. The key represents
|
||||||
normalization mode to apply.
|
the output data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
|
||||||
output_normalization_modes: Similar dictionary as `input_normalization_modes`, but to unnormalize to
|
normalization_mapping: A dictionary that maps from a str value of FeatureType (e.g., "STATE", "VISUAL") to
|
||||||
the original scale.
|
a corresponding NormalizationMode (e.g., NormalizationMode.MIN_MAX)
|
||||||
"""
|
"""
|
||||||
|
|
||||||
n_obs_steps: int = 1
|
n_obs_steps: int = 1
|
||||||
|
|||||||
@@ -28,7 +28,7 @@ class ACTConfig(PreTrainedConfig):
|
|||||||
Defaults are configured for training on bimanual Aloha tasks like "insertion" or "transfer".
|
Defaults are configured for training on bimanual Aloha tasks like "insertion" or "transfer".
|
||||||
|
|
||||||
The parameters you will most likely need to change are the ones which depend on the environment / sensors.
|
The parameters you will most likely need to change are the ones which depend on the environment / sensors.
|
||||||
Those are: `input_shapes` and 'output_shapes`.
|
Those are: `input_features` and `output_features`.
|
||||||
|
|
||||||
Notes on the inputs and outputs:
|
Notes on the inputs and outputs:
|
||||||
- Either:
|
- Either:
|
||||||
@@ -48,21 +48,12 @@ class ACTConfig(PreTrainedConfig):
|
|||||||
This should be no greater than the chunk size. For example, if the chunk size size 100, you may
|
This should be no greater than the chunk size. For example, if the chunk size size 100, you may
|
||||||
set this to 50. This would mean that the model predicts 100 steps worth of actions, runs 50 in the
|
set this to 50. This would mean that the model predicts 100 steps worth of actions, runs 50 in the
|
||||||
environment, and throws the other 50 out.
|
environment, and throws the other 50 out.
|
||||||
input_shapes: A dictionary defining the shapes of the input data for the policy. The key represents
|
input_features: A dictionary defining the PolicyFeature of the input data for the policy. The key represents
|
||||||
the input data name, and the value is a list indicating the dimensions of the corresponding data.
|
the input data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
|
||||||
For example, "observation.image" refers to an input from a camera with dimensions [3, 96, 96],
|
output_features: A dictionary defining the PolicyFeature of the output data for the policy. The key represents
|
||||||
indicating it has three color channels and 96x96 resolution. Importantly, `input_shapes` doesn't
|
the output data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
|
||||||
include batch dimension or temporal dimension.
|
normalization_mapping: A dictionary that maps from a str value of FeatureType (e.g., "STATE", "VISUAL") to
|
||||||
output_shapes: A dictionary defining the shapes of the output data for the policy. The key represents
|
a corresponding NormalizationMode (e.g., NormalizationMode.MIN_MAX)
|
||||||
the output data name, and the value is a list indicating the dimensions of the corresponding data.
|
|
||||||
For example, "action" refers to an output shape of [14], indicating 14-dimensional actions.
|
|
||||||
Importantly, `output_shapes` doesn't include batch dimension or temporal dimension.
|
|
||||||
input_normalization_modes: A dictionary with key representing the modality (e.g. "observation.state"),
|
|
||||||
and the value specifies the normalization mode to apply. The two available modes are "mean_std"
|
|
||||||
which subtracts the mean and divides by the standard deviation and "min_max" which rescale in a
|
|
||||||
[-1, 1] range.
|
|
||||||
output_normalization_modes: Similar dictionary as `normalize_input_modes`, but to unnormalize to the
|
|
||||||
original scale. Note that this is also used for normalizing the training targets.
|
|
||||||
vision_backbone: Name of the torchvision resnet backbone to use for encoding images.
|
vision_backbone: Name of the torchvision resnet backbone to use for encoding images.
|
||||||
pretrained_backbone_weights: Pretrained weights from torchvision to initialize the backbone.
|
pretrained_backbone_weights: Pretrained weights from torchvision to initialize the backbone.
|
||||||
`None` means no pretrained weights.
|
`None` means no pretrained weights.
|
||||||
|
|||||||
@@ -30,7 +30,7 @@ class DiffusionConfig(PreTrainedConfig):
|
|||||||
Defaults are configured for training with PushT providing proprioceptive and single camera observations.
|
Defaults are configured for training with PushT providing proprioceptive and single camera observations.
|
||||||
|
|
||||||
The parameters you will most likely need to change are the ones which depend on the environment / sensors.
|
The parameters you will most likely need to change are the ones which depend on the environment / sensors.
|
||||||
Those are: `input_shapes` and `output_shapes`.
|
Those are: `input_features` and `output_features`.
|
||||||
|
|
||||||
Notes on the inputs and outputs:
|
Notes on the inputs and outputs:
|
||||||
- "observation.state" is required as an input key.
|
- "observation.state" is required as an input key.
|
||||||
@@ -48,21 +48,12 @@ class DiffusionConfig(PreTrainedConfig):
|
|||||||
horizon: Diffusion model action prediction size as detailed in `DiffusionPolicy.select_action`.
|
horizon: Diffusion model action prediction size as detailed in `DiffusionPolicy.select_action`.
|
||||||
n_action_steps: The number of action steps to run in the environment for one invocation of the policy.
|
n_action_steps: The number of action steps to run in the environment for one invocation of the policy.
|
||||||
See `DiffusionPolicy.select_action` for more details.
|
See `DiffusionPolicy.select_action` for more details.
|
||||||
input_shapes: A dictionary defining the shapes of the input data for the policy. The key represents
|
input_features: A dictionary defining the PolicyFeature of the input data for the policy. The key represents
|
||||||
the input data name, and the value is a list indicating the dimensions of the corresponding data.
|
the input data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
|
||||||
For example, "observation.image" refers to an input from a camera with dimensions [3, 96, 96],
|
output_features: A dictionary defining the PolicyFeature of the output data for the policy. The key represents
|
||||||
indicating it has three color channels and 96x96 resolution. Importantly, `input_shapes` doesn't
|
the output data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
|
||||||
include batch dimension or temporal dimension.
|
normalization_mapping: A dictionary that maps from a str value of FeatureType (e.g., "STATE", "VISUAL") to
|
||||||
output_shapes: A dictionary defining the shapes of the output data for the policy. The key represents
|
a corresponding NormalizationMode (e.g., NormalizationMode.MIN_MAX)
|
||||||
the output data name, and the value is a list indicating the dimensions of the corresponding data.
|
|
||||||
For example, "action" refers to an output shape of [14], indicating 14-dimensional actions.
|
|
||||||
Importantly, `output_shapes` doesn't include batch dimension or temporal dimension.
|
|
||||||
input_normalization_modes: A dictionary with key representing the modality (e.g. "observation.state"),
|
|
||||||
and the value specifies the normalization mode to apply. The two available modes are "mean_std"
|
|
||||||
which subtracts the mean and divides by the standard deviation and "min_max" which rescale in a
|
|
||||||
[-1, 1] range.
|
|
||||||
output_normalization_modes: Similar dictionary as `normalize_input_modes`, but to unnormalize to the
|
|
||||||
original scale. Note that this is also used for normalizing the training targets.
|
|
||||||
vision_backbone: Name of the torchvision resnet backbone to use for encoding images.
|
vision_backbone: Name of the torchvision resnet backbone to use for encoding images.
|
||||||
crop_shape: (H, W) shape to crop images to as a preprocessing step for the vision backbone. Must fit
|
crop_shape: (H, W) shape to crop images to as a preprocessing step for the vision backbone. Must fit
|
||||||
within the image size. If None, no cropping is done.
|
within the image size. If None, no cropping is done.
|
||||||
|
|||||||
@@ -30,7 +30,7 @@ class TDMPCConfig(PreTrainedConfig):
|
|||||||
camera observations.
|
camera observations.
|
||||||
|
|
||||||
The parameters you will most likely need to change are the ones which depend on the environment / sensors.
|
The parameters you will most likely need to change are the ones which depend on the environment / sensors.
|
||||||
Those are: `input_shapes`, `output_shapes`, and perhaps `max_random_shift_ratio`.
|
Those are: `input_features`, `output_features`, and perhaps `max_random_shift_ratio`.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
n_action_repeats: The number of times to repeat the action returned by the planning. (hint: Google
|
n_action_repeats: The number of times to repeat the action returned by the planning. (hint: Google
|
||||||
@@ -40,24 +40,12 @@ class TDMPCConfig(PreTrainedConfig):
|
|||||||
is an alternative to using action repeats. If this is set to more than 1, then we require
|
is an alternative to using action repeats. If this is set to more than 1, then we require
|
||||||
`n_action_repeats == 1`, `use_mpc == True` and `n_action_steps <= horizon`. Note that this
|
`n_action_repeats == 1`, `use_mpc == True` and `n_action_steps <= horizon`. Note that this
|
||||||
approach of using multiple steps from the plan is not in the original implementation.
|
approach of using multiple steps from the plan is not in the original implementation.
|
||||||
input_shapes: A dictionary defining the shapes of the input data for the policy. The key represents
|
input_features: A dictionary defining the PolicyFeature of the input data for the policy. The key represents
|
||||||
the input data name, and the value is a list indicating the dimensions of the corresponding data.
|
the input data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
|
||||||
For example, "observation.image" refers to an input from a camera with dimensions [3, 96, 96],
|
output_features: A dictionary defining the PolicyFeature of the output data for the policy. The key represents
|
||||||
indicating it has three color channels and 96x96 resolution. Importantly, `input_shapes` doesn't
|
the output data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
|
||||||
include batch dimension or temporal dimension.
|
normalization_mapping: A dictionary that maps from a str value of FeatureType (e.g., "STATE", "VISUAL") to
|
||||||
output_shapes: A dictionary defining the shapes of the output data for the policy. The key represents
|
a corresponding NormalizationMode (e.g., NormalizationMode.MIN_MAX)
|
||||||
the output data name, and the value is a list indicating the dimensions of the corresponding data.
|
|
||||||
For example, "action" refers to an output shape of [14], indicating 14-dimensional actions.
|
|
||||||
Importantly, `output_shapes` doesn't include batch dimension or temporal dimension.
|
|
||||||
input_normalization_modes: A dictionary with key representing the modality (e.g. "observation.state"),
|
|
||||||
and the value specifies the normalization mode to apply. The two available modes are "mean_std"
|
|
||||||
which subtracts the mean and divides by the standard deviation and "min_max" which rescale in a
|
|
||||||
[-1, 1] range. Note that here this defaults to None meaning inputs are not normalized. This is to
|
|
||||||
match the original implementation.
|
|
||||||
output_normalization_modes: Similar dictionary as `normalize_input_modes`, but to unnormalize to the
|
|
||||||
original scale. Note that this is also used for normalizing the training targets. NOTE: Clipping
|
|
||||||
to [-1, +1] is used during MPPI/CEM. Therefore, it is recommended that you stick with "min_max"
|
|
||||||
normalization mode here.
|
|
||||||
image_encoder_hidden_dim: Number of channels for the convolutional layers used for image encoding.
|
image_encoder_hidden_dim: Number of channels for the convolutional layers used for image encoding.
|
||||||
state_encoder_hidden_dim: Hidden dimension for MLP used for state vector encoding.
|
state_encoder_hidden_dim: Hidden dimension for MLP used for state vector encoding.
|
||||||
latent_dim: Observation's latent embedding dimension.
|
latent_dim: Observation's latent embedding dimension.
|
||||||
|
|||||||
@@ -32,7 +32,7 @@ class VQBeTConfig(PreTrainedConfig):
|
|||||||
Defaults are configured for training with PushT providing proprioceptive and single camera observations.
|
Defaults are configured for training with PushT providing proprioceptive and single camera observations.
|
||||||
|
|
||||||
The parameters you will most likely need to change are the ones which depend on the environment / sensors.
|
The parameters you will most likely need to change are the ones which depend on the environment / sensors.
|
||||||
Those are: `input_shapes` and `output_shapes`.
|
Those are: `input_features` and `output_features`.
|
||||||
|
|
||||||
Notes on the inputs and outputs:
|
Notes on the inputs and outputs:
|
||||||
- "observation.state" is required as an input key.
|
- "observation.state" is required as an input key.
|
||||||
@@ -46,21 +46,12 @@ class VQBeTConfig(PreTrainedConfig):
|
|||||||
current step and additional steps going back).
|
current step and additional steps going back).
|
||||||
n_action_pred_token: Total number of current token and future tokens that VQ-BeT predicts.
|
n_action_pred_token: Total number of current token and future tokens that VQ-BeT predicts.
|
||||||
action_chunk_size: Action chunk size of each action prediction token.
|
action_chunk_size: Action chunk size of each action prediction token.
|
||||||
input_shapes: A dictionary defining the shapes of the input data for the policy.
|
input_features: A dictionary defining the PolicyFeature of the input data for the policy. The key represents
|
||||||
The key represents the input data name, and the value is a list indicating the dimensions
|
the input data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
|
||||||
of the corresponding data. For example, "observation.image" refers to an input from
|
output_features: A dictionary defining the PolicyFeature of the output data for the policy. The key represents
|
||||||
a camera with dimensions [3, 96, 96], indicating it has three color channels and 96x96 resolution.
|
the output data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
|
||||||
Importantly, shapes doesnt include batch dimension or temporal dimension.
|
normalization_mapping: A dictionary that maps from a str value of FeatureType (e.g., "STATE", "VISUAL") to
|
||||||
output_shapes: A dictionary defining the shapes of the output data for the policy.
|
a corresponding NormalizationMode (e.g., NormalizationMode.MIN_MAX)
|
||||||
The key represents the output data name, and the value is a list indicating the dimensions
|
|
||||||
of the corresponding data. For example, "action" refers to an output shape of [14], indicating
|
|
||||||
14-dimensional actions. Importantly, shapes doesnt include batch dimension or temporal dimension.
|
|
||||||
input_normalization_modes: A dictionary with key representing the modality (e.g. "observation.state"),
|
|
||||||
and the value specifies the normalization mode to apply. The two available modes are "mean_std"
|
|
||||||
which subtracts the mean and divides by the standard deviation and "min_max" which rescale in a
|
|
||||||
[-1, 1] range.
|
|
||||||
output_normalization_modes: Similar dictionary as `normalize_input_modes`, but to unnormalize to the
|
|
||||||
original scale. Note that this is also used for normalizing the training targets.
|
|
||||||
vision_backbone: Name of the torchvision resnet backbone to use for encoding images.
|
vision_backbone: Name of the torchvision resnet backbone to use for encoding images.
|
||||||
crop_shape: (H, W) shape to crop images to as a preprocessing step for the vision backbone. Must fit
|
crop_shape: (H, W) shape to crop images to as a preprocessing step for the vision backbone. Must fit
|
||||||
within the image size. If None, no cropping is done.
|
within the image size. If None, no cropping is done.
|
||||||
|
|||||||
Reference in New Issue
Block a user