mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-15 08:39:49 +00:00
bd9619dfc3
* chore(video backend): renaming codec into video_backend in get_safe_default_video_backend() * feat(pyav utils): adding suport for PyAV encoding parameters validation * feat(VideoEncoderConfig): creating a VideoEncoderConfig to encapsulate encoding parameters * feat(VideoEncoderConfig): propagating the VideoEncoderConfig in the codebase * chore(docs): updating the docs * feat(metadata): adding encoding parameters in dataset metadata * fix(concatenation compatibility): adding compatibility check when concatenating video files * feat(VideoEncoderConfig init): making VideoEncoderConfig more robust and adaptable to multiple backends * feat(pyav checks): making pyav parameters checks more robust * chore(duplicate): removing duplicate get_codec_options definition * test(existing): adapting existing tests * test(new): adding new tests for encoding related features * chore(format): fixing formatting issues * chore(PyAV): cleaning up PyAV utils and encoding parameters checks to stick to the minimun required tooling. * chore(format): formatting code * chore(doctrings): updating docstrings * fix(camera_encoder_config): Removing camera_encoder_config from LeRobotDataset, as it's only required in LeRobotDatasetWriter. * feat(default values): applying a consistent naming convention for default RGB cameras video encoder parameters * fix(rollout): propagating VideoEncoderConfig to the latest recording modes * chore(format): formatting code, fixing error messages and variable names * fix(arguments order): reverting changes in arguments order in StreamingVideoEncoder * chore(relative imports): switching to relative local imports within lerobot.datasets * test(artifacts): cleaning up artifacts for the video encoding tests * chore(docs): updating docs * chore(fromat): formatting code * fix(imports): refactoring the file architecture to avoid circular imports. VideoEncoderConfig is now defined in lerobot.configs and lazily imports av at runtime. * fix(typos): fixing typos and small mistakes * test(factories): updating factories * feat(aggregate): updating dataset aggregation procedure. Encoding tuning paramters (crf, g,...) are ignored for validation and changed to None in the aggregated dataset if incompatible. * docs(typos): fixing typos * fix(deletion): reverting unwanted deletion * fix(typos): fixing multiple typos * feat(codec options): passing codec options to lerobot_edit_dataset episode deletion tool * typo(typo): typo * fix(typos): fixing remaining typos * chore(rename): renaming camera_encoder_config to camera_encoder * docs(clean): cleaning and formating docs * docs(dataset): addind details about datasets * chore(format): formatting code * docs(warning): adding warning regarding encoding parameters modification * fix(re-encoding): removing inconsistent re-encoding option in lerobot_edit_dataset * typos(typos): typos * chore(format): resolving prettier issues * fix(h264_nvenc): fixing crf handling for h264_nvenc * docs(clean): removing too technical parts of the docs * fix(imports): fixing imports at the __init__ level * fix(imports): fixing not very pretty imports in video config file
96 lines
3.8 KiB
Plaintext
96 lines
3.8 KiB
Plaintext
# ACT (Action Chunking with Transformers)
|
|
|
|
ACT is a **lightweight and efficient policy for imitation learning**, especially well-suited for fine-grained manipulation tasks. It's the **first model we recommend when you're starting out** with LeRobot due to its fast training time, low computational requirements, and strong performance.
|
|
|
|
<div class="video-container">
|
|
<iframe
|
|
width="100%"
|
|
height="415"
|
|
src="https://www.youtube.com/embed/ft73x0LfGpM"
|
|
title="LeRobot ACT Tutorial"
|
|
frameborder="0"
|
|
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
|
|
allowfullscreen
|
|
></iframe>
|
|
</div>
|
|
|
|
_Watch this tutorial from the LeRobot team to learn how ACT works: [LeRobot ACT Tutorial](https://www.youtube.com/watch?v=ft73x0LfGpM)_
|
|
|
|
## Model Overview
|
|
|
|
Action Chunking with Transformers (ACT) was introduced in the paper [Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware](https://arxiv.org/abs/2304.13705) by Zhao et al. The policy was designed to enable precise, contact-rich manipulation tasks using affordable hardware and minimal demonstration data.
|
|
|
|
### Why ACT is Great for Beginners
|
|
|
|
ACT stands out as an excellent starting point for several reasons:
|
|
|
|
- **Fast Training**: Trains in a few hours on a single GPU
|
|
- **Lightweight**: Only ~80M parameters, making it efficient and easy to work with
|
|
- **Data Efficient**: Often achieves high success rates with just 50 demonstrations
|
|
|
|
### Architecture
|
|
|
|
ACT uses a transformer-based architecture with three main components:
|
|
|
|
1. **Vision Backbone**: ResNet-18 processes images from multiple camera viewpoints
|
|
2. **Transformer Encoder**: Synthesizes information from camera features, joint positions, and a learned latent variable
|
|
3. **Transformer Decoder**: Generates coherent action sequences using cross-attention
|
|
|
|
The policy takes as input:
|
|
|
|
- Multiple RGB images (e.g., from wrist cameras, front/top cameras)
|
|
- Current robot joint positions
|
|
- A latent style variable `z` (learned during training, set to zero during inference)
|
|
|
|
And outputs a chunk of `k` future action sequences.
|
|
|
|
## Installation Requirements
|
|
|
|
1. Install LeRobot by following our [Installation Guide](./installation).
|
|
2. ACT is included in the base LeRobot installation, so no additional dependencies are needed!
|
|
|
|
## Training ACT
|
|
|
|
ACT works seamlessly with the standard LeRobot training pipeline. Here's a complete example for training ACT on your dataset:
|
|
|
|
```bash
|
|
lerobot-train \
|
|
--dataset.repo_id=${HF_USER}/your_dataset \
|
|
--policy.type=act \
|
|
--output_dir=outputs/train/act_your_dataset \
|
|
--job_name=act_your_dataset \
|
|
--policy.device=cuda \
|
|
--wandb.enable=true \
|
|
--policy.repo_id=${HF_USER}/act_policy
|
|
```
|
|
|
|
### Training Tips
|
|
|
|
1. **Start with defaults**: ACT's default hyperparameters work well for most tasks
|
|
2. **Training duration**: Expect a few hours for 100k training steps on a single GPU
|
|
3. **Batch size**: Start with batch size 8 and adjust based on your GPU memory
|
|
|
|
### Train using Google Colab
|
|
|
|
If your local computer doesn't have a powerful GPU, you can utilize Google Colab to train your model by following the [ACT training notebook](./notebooks#training-act).
|
|
|
|
## Evaluating ACT
|
|
|
|
Once training is complete, you can evaluate your ACT policy using the `lerobot-record` command with your trained policy. This will run inference and record evaluation episodes:
|
|
|
|
```bash
|
|
lerobot-record \
|
|
--robot.type=so100_follower \
|
|
--robot.port=/dev/ttyACM0 \
|
|
--robot.id=my_robot \
|
|
--robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
|
|
--display_data=true \
|
|
--dataset.repo_id=${HF_USER}/eval_act_your_dataset \
|
|
--dataset.num_episodes=10 \
|
|
--dataset.single_task="Your task description" \
|
|
--dataset.streaming_encoding=true \
|
|
--dataset.encoder_threads=2 \
|
|
# --dataset.camera_encoder.vcodec=auto \
|
|
--policy.path=${HF_USER}/act_policy
|
|
```
|