mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-15 16:49:55 +00:00
133 lines
6.0 KiB
Plaintext
133 lines
6.0 KiB
Plaintext
# EVO1
|
|
|
|
EVO1 is a Vision-Language-Action policy for robot control built around an InternVL3 backbone and a continuous flow-matching action head. This LeRobot integration exposes EVO1 as a standard policy type so it can be trained and evaluated with the usual LeRobot dataset, checkpoint, and processor APIs.
|
|
|
|
## Model Overview
|
|
|
|
The policy embeds one or more camera images and the language task prompt with InternVL3, pads robot state/action vectors to fixed maximum dimensions, and predicts future action chunks with a flow-matching action head. During inference, the policy samples an action chunk and returns `n_action_steps` actions from that chunk before sampling again.
|
|
|
|
### What the LeRobot Integration Covers
|
|
|
|
- Standard `policy.type=evo1` configuration through LeRobot
|
|
- InternVL3 image/text embedding with optional FlashAttention fallback
|
|
- Stage-based finetuning controls for action-head-only and VLM finetuning runs
|
|
- Continuous flow-matching action prediction
|
|
- Checkpoint save/load through LeRobot policy APIs
|
|
- Training with `lerobot-train` and evaluation with standard policy inference APIs
|
|
|
|
The broader EVO1 project may include additional training scripts and dataset tooling. This page focuses on the LeRobot robot-control policy path.
|
|
|
|
## Installation Requirements
|
|
|
|
1. Install LeRobot by following the [Installation Guide](./installation).
|
|
2. Install EVO1 dependencies:
|
|
|
|
```bash
|
|
pip install -e ".[evo1]"
|
|
```
|
|
|
|
3. Install a `flash-attn` wheel only if it is compatible with your Python, PyTorch, CUDA, and GPU stack. EVO1 falls back to standard attention when `flash_attn` is not available.
|
|
|
|
EVO1 uses InternVL3 through the Hugging Face `transformers` remote-code path, so the first run may download the configured VLM checkpoint unless `policy.vlm_model_name` points to a local model directory.
|
|
|
|
## Data Requirements
|
|
|
|
EVO1 expects a LeRobot dataset with:
|
|
|
|
- One to `policy.max_views` visual observations, for example `observation.images.image`
|
|
- `observation.state`
|
|
- `action`
|
|
- A language task instruction in the dataset `task` field, or another field configured with `policy.task_field`
|
|
|
|
State and action vectors are padded to `policy.max_state_dim` and `policy.max_action_dim`. Predictions are cropped back to the dataset action dimension before being returned.
|
|
|
|
## Usage
|
|
|
|
To use EVO1 in a LeRobot configuration, specify:
|
|
|
|
```python
|
|
policy.type=evo1
|
|
```
|
|
|
|
By default, a new EVO1 policy initializes its VLM from:
|
|
|
|
```python
|
|
policy.vlm_model_name=OpenGVLab/InternVL3-1B
|
|
```
|
|
|
|
Once a LeRobot-format EVO1 checkpoint is available, load it with:
|
|
|
|
```python
|
|
policy.path=your-org/your-evo1-checkpoint
|
|
```
|
|
|
|
## Training
|
|
|
|
### Stage 1
|
|
|
|
Stage 1 freezes the VLM and trains the action head:
|
|
|
|
```bash
|
|
lerobot-train \
|
|
--dataset.repo_id=your_org/your_dataset \
|
|
--policy.type=evo1 \
|
|
--policy.training_stage=stage1 \
|
|
--policy.vlm_model_name=OpenGVLab/InternVL3-1B \
|
|
--policy.device=cuda \
|
|
--policy.chunk_size=50 \
|
|
--policy.n_action_steps=50 \
|
|
--policy.max_state_dim=24 \
|
|
--policy.max_action_dim=24 \
|
|
--policy.optimizer_lr=1e-5 \
|
|
--batch_size=4 \
|
|
--steps=5000 \
|
|
--output_dir=./outputs/evo1_stage1
|
|
```
|
|
|
|
### Stage 2
|
|
|
|
Stage 2 finetunes the VLM branches and action head. A common workflow starts from a Stage 1 checkpoint:
|
|
|
|
```bash
|
|
lerobot-train \
|
|
--dataset.repo_id=your_org/your_dataset \
|
|
--policy.path=./outputs/evo1_stage1/checkpoints/005000/pretrained_model \
|
|
--policy.training_stage=stage2 \
|
|
--policy.vlm_model_name=OpenGVLab/InternVL3-1B \
|
|
--policy.device=cuda \
|
|
--policy.chunk_size=50 \
|
|
--policy.n_action_steps=50 \
|
|
--policy.max_state_dim=24 \
|
|
--policy.max_action_dim=24 \
|
|
--policy.optimizer_lr=1e-5 \
|
|
--batch_size=4 \
|
|
--steps=80000 \
|
|
--output_dir=./outputs/evo1_stage2
|
|
```
|
|
|
|
### Key Training Parameters
|
|
|
|
| Parameter | Default | Description |
|
|
| --------------------------------------------- | ------------------------ | ----------------------------------------------------------------- |
|
|
| `policy.vlm_model_name` | `OpenGVLab/InternVL3-1B` | InternVL3 checkpoint or local model directory |
|
|
| `policy.training_stage` | `stage1` | `stage1` trains the action head; `stage2` finetunes VLM branches |
|
|
| `policy.vlm_num_layers` | `14` | Number of InternVL3 language layers kept for the policy |
|
|
| `policy.vlm_dtype` | `bfloat16` | Requested VLM dtype |
|
|
| `policy.use_flash_attn` | `true` | Requests FlashAttention when installed; otherwise falls back |
|
|
| `policy.enable_gradient_checkpointing` | `true` | Enables checkpointing on supported InternVL3 modules |
|
|
| `policy.gradient_checkpointing_use_reentrant` | `false` | Reentrant setting passed to gradient checkpointing when supported |
|
|
| `policy.chunk_size` | `50` | Number of future actions predicted per chunk |
|
|
| `policy.n_action_steps` | `50` | Number of actions consumed from a sampled chunk |
|
|
| `policy.max_state_dim` | `24` | State padding dimension |
|
|
| `policy.max_action_dim` | `24` | Action padding dimension |
|
|
| `policy.task_field` | `task` | Batch field used as the language prompt |
|
|
|
|
## References
|
|
|
|
- [EVO1 repository](https://github.com/MINT-SJTU/Evo-1)
|
|
- [InternVL3-1B](https://huggingface.co/OpenGVLab/InternVL3-1B)
|
|
|
|
## License
|
|
|
|
This LeRobot integration follows the Apache 2.0 License used by LeRobot. Check the upstream EVO1 and InternVL3 model pages for the licenses of released checkpoints and data.
|