# EVO1 EVO1 is a Vision-Language-Action policy for robot control built around an InternVL3 backbone and a continuous flow-matching action head. This LeRobot integration exposes EVO1 as a standard policy type so it can be trained and evaluated with the usual LeRobot dataset, checkpoint, and processor APIs. ## Model Overview The policy embeds one or more camera images and the language task prompt with InternVL3, pads robot state/action vectors to fixed maximum dimensions, and predicts future action chunks with a flow-matching action head. During inference, the policy samples an action chunk and returns `n_action_steps` actions from that chunk before sampling again. ### What the LeRobot Integration Covers - Standard `policy.type=evo1` configuration through LeRobot - InternVL3 image/text embedding with optional FlashAttention fallback - Stage-based finetuning controls for action-head-only and VLM finetuning runs - Continuous flow-matching action prediction - Checkpoint save/load through LeRobot policy APIs - Training with `lerobot-train` and evaluation with standard policy inference APIs The broader EVO1 project may include additional training scripts and dataset tooling. This page focuses on the LeRobot robot-control policy path. ## Installation Requirements 1. Install LeRobot by following the [Installation Guide](./installation). 2. Install EVO1 dependencies: ```bash pip install -e ".[evo1]" ``` 3. Install a `flash-attn` wheel only if it is compatible with your Python, PyTorch, CUDA, and GPU stack. EVO1 falls back to standard attention when `flash_attn` is not available. EVO1 uses InternVL3 through the Hugging Face `transformers` remote-code path, so the first run may download the configured VLM checkpoint unless `policy.vlm_model_name` points to a local model directory. ## Data Requirements EVO1 expects a LeRobot dataset with: - One to `policy.max_views` visual observations, for example `observation.images.image` - `observation.state` - `action` - A language task instruction in the dataset `task` field, or another field configured with `policy.task_field` State and action vectors are padded to `policy.max_state_dim` and `policy.max_action_dim`. Predictions are cropped back to the dataset action dimension before being returned. ## Usage To use EVO1 in a LeRobot configuration, specify: ```python policy.type=evo1 ``` By default, a new EVO1 policy initializes its VLM from: ```python policy.vlm_model_name=OpenGVLab/InternVL3-1B ``` Once a LeRobot-format EVO1 checkpoint is available, load it with: ```python policy.path=your-org/your-evo1-checkpoint ``` ## Training ### Stage 1 Stage 1 freezes the VLM and trains the action head: ```bash lerobot-train \ --dataset.repo_id=your_org/your_dataset \ --policy.type=evo1 \ --policy.training_stage=stage1 \ --policy.vlm_model_name=OpenGVLab/InternVL3-1B \ --policy.device=cuda \ --policy.chunk_size=50 \ --policy.n_action_steps=50 \ --policy.max_state_dim=24 \ --policy.max_action_dim=24 \ --policy.optimizer_lr=1e-5 \ --batch_size=4 \ --steps=5000 \ --output_dir=./outputs/evo1_stage1 ``` ### Stage 2 Stage 2 finetunes the VLM branches and action head. A common workflow starts from a Stage 1 checkpoint: ```bash lerobot-train \ --dataset.repo_id=your_org/your_dataset \ --policy.path=./outputs/evo1_stage1/checkpoints/005000/pretrained_model \ --policy.training_stage=stage2 \ --policy.vlm_model_name=OpenGVLab/InternVL3-1B \ --policy.device=cuda \ --policy.chunk_size=50 \ --policy.n_action_steps=50 \ --policy.max_state_dim=24 \ --policy.max_action_dim=24 \ --policy.optimizer_lr=1e-5 \ --batch_size=4 \ --steps=80000 \ --output_dir=./outputs/evo1_stage2 ``` ### Key Training Parameters | Parameter | Default | Description | | --------------------------------------------- | ------------------------ | ----------------------------------------------------------------- | | `policy.vlm_model_name` | `OpenGVLab/InternVL3-1B` | InternVL3 checkpoint or local model directory | | `policy.training_stage` | `stage1` | `stage1` trains the action head; `stage2` finetunes VLM branches | | `policy.vlm_num_layers` | `14` | Number of InternVL3 language layers kept for the policy | | `policy.vlm_dtype` | `bfloat16` | Requested VLM dtype | | `policy.use_flash_attn` | `true` | Requests FlashAttention when installed; otherwise falls back | | `policy.enable_gradient_checkpointing` | `true` | Enables checkpointing on supported InternVL3 modules | | `policy.gradient_checkpointing_use_reentrant` | `false` | Reentrant setting passed to gradient checkpointing when supported | | `policy.chunk_size` | `50` | Number of future actions predicted per chunk | | `policy.n_action_steps` | `50` | Number of actions consumed from a sampled chunk | | `policy.max_state_dim` | `24` | State padding dimension | | `policy.max_action_dim` | `24` | Action padding dimension | | `policy.task_field` | `task` | Batch field used as the language prompt | ## References - [EVO1 repository](https://github.com/MINT-SJTU/Evo-1) - [InternVL3-1B](https://huggingface.co/OpenGVLab/InternVL3-1B) ## License This LeRobot integration follows the Apache 2.0 License used by LeRobot. Check the upstream EVO1 and InternVL3 model pages for the licenses of released checkpoints and data.