mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-15 00:29:52 +00:00
Compare commits
20 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 63fad12e8d | |||
| 2545f1a8ed | |||
| 5f85b572d7 | |||
| 72692525da | |||
| 9a298524ca | |||
| 002a9dd0b9 | |||
| e40985b013 | |||
| d03200bdb3 | |||
| ac41cd6672 | |||
| 9b211a45d6 | |||
| a6387da464 | |||
| 0328b3f4aa | |||
| cc8e4c0d86 | |||
| 819c1b9710 | |||
| f0848c6887 | |||
| 19c6adef85 | |||
| 96b7f3dae0 | |||
| 885ef91892 | |||
| b0efa73520 | |||
| 00b662de02 |
@@ -165,7 +165,7 @@ hf auth login --token ${HUGGINGFACE_TOKEN} --add-to-git-credential
|
||||
Then store your Hugging Face repository name in a variable:
|
||||
|
||||
```bash
|
||||
HF_USER=$(hf auth whoami | awk -F': *' 'NR==1 {print $2}')
|
||||
HF_USER=$(NO_COLOR=1 hf auth whoami | awk -F': *' 'NR==1 {print $2}')
|
||||
echo $HF_USER
|
||||
```
|
||||
|
||||
|
||||
+84
-43
@@ -12,36 +12,59 @@ The Unitree G1 humanoid is now supported in LeRobot! You can teleoperate, train
|
||||
|
||||
## Part 1: Getting Started
|
||||
|
||||
### Install LeRobot on Your Machine
|
||||
### Install the Unitree SDK
|
||||
|
||||
Follow the [unitree_sdk2_python installation guide](https://github.com/unitreerobotics/unitree_sdk2_python#installation). Tested with `unitree_sdk2py==1.0.1` and `cyclonedds==0.10.2`:
|
||||
|
||||
```bash
|
||||
conda create -y -n lerobot python=3.12
|
||||
conda activate lerobot
|
||||
git clone https://github.com/unitreerobotics/unitree_sdk2_python.git
|
||||
cd unitree_sdk2_python && pip install -e .
|
||||
cd unitree_sdk2_python
|
||||
pip install -e .
|
||||
cd ..
|
||||
```
|
||||
|
||||
### Install LeRobot
|
||||
|
||||
```bash
|
||||
conda install ffmpeg -c conda-forge
|
||||
conda install -c conda-forge "pinocchio>=3.0.0,<4.0.0"
|
||||
git clone https://github.com/huggingface/lerobot.git
|
||||
cd lerobot
|
||||
pip install -e '.[unitree_g1]'
|
||||
```
|
||||
|
||||
<Tip>
|
||||
For now, pinocchio must be installed from conda-forge (not pip) to include the
|
||||
CasADi bindings needed for arm IK.
|
||||
</Tip>
|
||||
|
||||
### Test the Installation (Simulation)
|
||||
|
||||
The simulation environment has its own dependencies. Check the Simulation environment dependencies: [Unitree G1 Mujoco EnvHub](https://huggingface.co/lerobot/unitree-g1-mujoco/tree/main).
|
||||
|
||||
```bash
|
||||
pip install mujoco loguru msgpack msgpack-numpy
|
||||
```
|
||||
|
||||
```bash
|
||||
lerobot-teleoperate \
|
||||
--robot.type=unitree_g1 \
|
||||
--robot.is_simulation=true \
|
||||
--teleop.type=unitree_g1 \
|
||||
--teleop.id=wbc_unitree \
|
||||
--robot.cameras='{"global_view": {"type": "zmq", "server_address": "localhost", "port": 5555, "camera_name": "head_camera", "width": 640, "height": 480, "fps": 30}}' \
|
||||
--display_data=true
|
||||
--robot.cameras='{"global_view": {"type": "zmq", "server_address": "localhost", "port": 5555, "camera_name": "head_camera", "width": 640, "height": 480, "fps": 30, "warmup_s": 5}}' \
|
||||
--display_data=true \
|
||||
--robot.controller=GrootLocomotionController
|
||||
```
|
||||
|
||||
This will launch a [MuJoCo sim instance](https://huggingface.co/lerobot/unitree-g1-mujoco/tree/main) for the G1.
|
||||
This will launch a [MuJoCo sim instance](https://huggingface.co/lerobot/unitree-g1-mujoco/tree/main) for the G1. You can connect a gamepad to your machine before launching in order to control the robot's locomotion in sim. We support both [HolosomaLocomotionController](https://github.com/amazon-far/holosoma) and [GrootLocomotionController](https://github.com/NVlabs/GR00T-WholeBodyControl) via `--robot.controller`.
|
||||
|
||||
- Press `9` to release the robot
|
||||
- Press `7` / `8` to increase / decrease waist height
|
||||
|
||||
### Connect to the Robot
|
||||
### Connect to the Physical Robot
|
||||
|
||||
The G1's Ethernet IP is fixed at `192.168.123.164`. Your machine must have a static IP on the same subnet: `192.168.123.x` where `x ≠ 164`.
|
||||
|
||||
@@ -59,37 +82,11 @@ ssh unitree@192.168.123.164
|
||||
# Password: 123
|
||||
```
|
||||
|
||||
### Install LeRobot on the G1
|
||||
### Share Internet via Ethernet
|
||||
|
||||
From the robot:
|
||||
The G1 needs internet access to clone repos and install packages. Share your laptop's connection over Ethernet:
|
||||
|
||||
```bash
|
||||
conda create -y -n lerobot python=3.12
|
||||
conda activate lerobot
|
||||
git clone https://github.com/unitreerobotics/unitree_sdk2_python.git
|
||||
cd unitree_sdk2_python && pip install -e .
|
||||
git clone https://github.com/huggingface/lerobot.git
|
||||
cd lerobot
|
||||
pip install -e '.[unitree_g1]'
|
||||
```
|
||||
|
||||
> **Note:** The Unitree SDK requires CycloneDDS v0.10.2. See the [Unitree SDK docs](https://github.com/unitreerobotics/unitree_sdk2_python) for details.
|
||||
|
||||
---
|
||||
|
||||
## Part 2: Enable WiFi on the Robot
|
||||
|
||||
Wi-Fi connectivity is blocked by default on the G1. To activate:
|
||||
|
||||
```bash
|
||||
sudo rfkill unblock all
|
||||
sudo ip link set wlan0 up
|
||||
sudo nmcli radio wifi on
|
||||
sudo nmcli device set wlan0 managed yes
|
||||
sudo systemctl restart NetworkManager
|
||||
```
|
||||
|
||||
**On your laptop** (share internet via Ethernet):
|
||||
**On your laptop:**
|
||||
|
||||
```bash
|
||||
sudo sysctl -w net.ipv4.ip_forward=1
|
||||
@@ -100,7 +97,7 @@ sudo iptables -A FORWARD -i wlp132s0f0 -o enp131s0 -m state --state RELATED,ESTA
|
||||
sudo iptables -A FORWARD -i enp131s0 -o wlp132s0f0 -j ACCEPT
|
||||
```
|
||||
|
||||
**On the G1** (set default route through your laptop):
|
||||
**On the G1:**
|
||||
|
||||
```bash
|
||||
sudo ip route del default 2>/dev/null || true
|
||||
@@ -111,6 +108,45 @@ echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
|
||||
ping -c 3 8.8.8.8
|
||||
```
|
||||
|
||||
### Install the Unitree SDK on the G1
|
||||
|
||||
Follow the [unitree_sdk2_python installation guide](https://github.com/unitreerobotics/unitree_sdk2_python#installation):
|
||||
|
||||
```bash
|
||||
conda create -y -n lerobot python=3.12
|
||||
conda activate lerobot
|
||||
git clone https://github.com/unitreerobotics/unitree_sdk2_python.git
|
||||
cd unitree_sdk2_python
|
||||
python -m pip install -e .
|
||||
cd ..
|
||||
```
|
||||
|
||||
### Install LeRobot on the G1
|
||||
|
||||
```bash
|
||||
git clone https://github.com/huggingface/lerobot.git
|
||||
cd lerobot
|
||||
conda install -c conda-forge "pinocchio>=3.0.0,<4.0.0"
|
||||
python -m pip install -e '.[unitree_g1]'
|
||||
```
|
||||
|
||||
<Tip>
|
||||
For now, pinocchio must be installed from conda-forge (not pip) to include the
|
||||
CasADi bindings needed for arm IK.
|
||||
</Tip>
|
||||
|
||||
### (Optional) Enable WiFi on the Robot
|
||||
|
||||
For wireless SSH access, you can enable WiFi on the G1 (it's blocked by default):
|
||||
|
||||
```bash
|
||||
sudo rfkill unblock all
|
||||
sudo ip link set wlan0 up
|
||||
sudo nmcli radio wifi on
|
||||
sudo nmcli device set wlan0 managed yes
|
||||
sudo systemctl restart NetworkManager
|
||||
```
|
||||
|
||||
**Connect to a WiFi network:**
|
||||
|
||||
```bash
|
||||
@@ -125,7 +161,7 @@ sudo nmcli connection up "YourNetwork"
|
||||
ip a show wlan0
|
||||
```
|
||||
|
||||
You can now SSH over WiFi:
|
||||
You can then SSH over WiFi instead of Ethernet:
|
||||
|
||||
```bash
|
||||
ssh unitree@<ROBOT_WIFI_IP>
|
||||
@@ -134,18 +170,23 @@ ssh unitree@<ROBOT_WIFI_IP>
|
||||
|
||||
---
|
||||
|
||||
## Part 3: Teleoperation & Locomotion
|
||||
## Part 2: Teleoperation & Locomotion
|
||||
|
||||
### Run the Robot Server
|
||||
|
||||
On the robot:
|
||||
On the robot (from `~/lerobot`):
|
||||
|
||||
```bash
|
||||
cd ~/lerobot
|
||||
python src/lerobot/robots/unitree_g1/run_g1_server.py --camera
|
||||
```
|
||||
|
||||
### Run the Locomotion Policy
|
||||
|
||||
You can run the teleoperation client from your laptop over Ethernet, over WiFi (experimental), or directly on the robot itself. Mind potential latency introduced by your network.
|
||||
|
||||
**From your laptop:**
|
||||
|
||||
```bash
|
||||
lerobot-teleoperate \
|
||||
--robot.type=unitree_g1 \
|
||||
@@ -158,13 +199,13 @@ lerobot-teleoperate \
|
||||
--robot.controller=HolosomaLocomotionController
|
||||
```
|
||||
|
||||
We support both [HolosomaLocomotionController](https://github.com/amazon-far/holosoma) and [GrootLocomotionController](https://github.com/NVlabs/GR00T-WholeBodyControl).
|
||||
We support both [GrootLocomotionController](https://github.com/NVlabs/GR00T-WholeBodyControl) and [HolosomaLocomotionController](https://github.com/amazon-far/holosoma) via `--robot.controller`.
|
||||
|
||||
---
|
||||
|
||||
## Part 4: Loco-Manipulation with the Homunculus Exoskeleton
|
||||
## Part 3: Loco-Manipulation with the Homunculus Exoskeleton
|
||||
|
||||
We provide a loco-manipulation solution via the Homunculus Exoskeleton — an open-source 7 DoF exoskeleton for whole-body control. Assembly instructions [here](https://github.com/nepyope/hmc_exo).
|
||||
We provide a loco-manipulation solution via the Homunculus Exoskeleton — an open-source 7 DoF exoskeleton for whole-body control. Check it out [here](https://github.com/nepyope/hmc_exo).
|
||||
|
||||
### Calibrate
|
||||
|
||||
@@ -205,7 +246,7 @@ Example dataset: [nepyope/unitree_box_move_blue_full](https://huggingface.co/dat
|
||||
|
||||
---
|
||||
|
||||
## Part 5: Training & Inference
|
||||
## Part 4: Training & Inference
|
||||
|
||||
### Train
|
||||
|
||||
|
||||
+6
-6
@@ -25,7 +25,7 @@ discord = "https://discord.gg/s3KuuzsPFb"
|
||||
|
||||
[project]
|
||||
name = "lerobot"
|
||||
version = "0.4.5"
|
||||
version = "0.5.1"
|
||||
description = "🤗 LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch"
|
||||
dynamic = ["readme"]
|
||||
license = { text = "Apache-2.0" }
|
||||
@@ -76,7 +76,7 @@ dependencies = [
|
||||
"torchvision>=0.21.0,<0.26.0",
|
||||
|
||||
"einops>=0.8.0,<0.9.0",
|
||||
"opencv-python-headless>=4.9.0,<4.13.0",
|
||||
"opencv-python-headless>=4.9.0,<4.14.0",
|
||||
"av>=15.0.0,<16.0.0",
|
||||
"jsonlines>=4.0.0,<5.0.0",
|
||||
"pynput>=1.7.8,<1.9.0",
|
||||
@@ -100,7 +100,7 @@ dependencies = [
|
||||
pygame-dep = ["pygame>=2.5.1,<2.7.0"]
|
||||
placo-dep = ["placo>=0.9.6,<0.9.17"]
|
||||
transformers-dep = ["transformers>=5.3.0,<6.0.0"]
|
||||
grpcio-dep = ["grpcio==1.73.1", "protobuf>=6.31.1,<6.34.0"]
|
||||
grpcio-dep = ["grpcio==1.73.1", "protobuf>=6.31.1,<6.32.0"]
|
||||
can-dep = ["python-can>=4.2.0,<5.0.0"]
|
||||
peft-dep = ["peft>=0.18.0,<1.0.0"]
|
||||
scipy-dep = ["scipy>=1.14.0,<2.0.0"]
|
||||
@@ -119,14 +119,13 @@ gamepad = ["lerobot[pygame-dep]", "hidapi>=0.14.0,<0.15.0"]
|
||||
hopejr = ["lerobot[feetech]", "lerobot[pygame-dep]"]
|
||||
lekiwi = ["lerobot[feetech]", "pyzmq>=26.2.1,<28.0.0"]
|
||||
unitree_g1 = [
|
||||
"unitree-sdk2==1.0.1",
|
||||
# "unitree-sdk2==1.0.1",
|
||||
"pyzmq>=26.2.1,<28.0.0",
|
||||
"onnxruntime>=1.16.0,<2.0.0",
|
||||
"pin>=3.0.0,<4.0.0",
|
||||
"onnx>=1.16.0,<2.0.0",
|
||||
"meshcat>=0.3.0,<0.4.0",
|
||||
"lerobot[matplotlib-dep]",
|
||||
"lerobot[pygame-dep]",
|
||||
"casadi>=3.6.0,<4.0.0",
|
||||
]
|
||||
reachy2 = ["reachy2_sdk>=1.0.15,<1.1.0"]
|
||||
kinematics = ["lerobot[placo-dep]"]
|
||||
@@ -223,6 +222,7 @@ lerobot-eval="lerobot.scripts.lerobot_eval:main"
|
||||
lerobot-train="lerobot.scripts.lerobot_train:main"
|
||||
lerobot-train-tokenizer="lerobot.scripts.lerobot_train_tokenizer:main"
|
||||
lerobot-dataset-viz="lerobot.scripts.lerobot_dataset_viz:main"
|
||||
lerobot-dataset-subtask-annotate="lerobot.scripts.lerobot_subtask_annotate:main"
|
||||
lerobot-info="lerobot.scripts.lerobot_info:main"
|
||||
lerobot-find-joint-limits="lerobot.scripts.lerobot_find_joint_limits:main"
|
||||
lerobot-imgtransform-viz="lerobot.scripts.lerobot_imgtransform_viz:main"
|
||||
|
||||
@@ -0,0 +1,2 @@
|
||||
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
|
||||
# Data annotations for subtasks and VLM-based labeling.
|
||||
@@ -0,0 +1,671 @@
|
||||
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import json
|
||||
import subprocess
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
import cv2
|
||||
|
||||
from lerobot.datasets.dataset_tools import add_features
|
||||
from lerobot.datasets.lerobot_dataset import LeRobotDataset
|
||||
from lerobot.datasets.utils import (
|
||||
create_subtask_index_array,
|
||||
create_subtasks_dataframe,
|
||||
save_subtasks,
|
||||
)
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from lerobot.data_processing.data_annotations.vlm_annotations import BaseVLM
|
||||
|
||||
|
||||
# Skill Annotation Data Structures
|
||||
class Skill:
|
||||
"""Represents a single atomic skill/subtask in a demonstration."""
|
||||
|
||||
def __init__(self, name: str, start: float, end: float):
|
||||
self.name = name
|
||||
self.start = start # Start timestamp in seconds
|
||||
self.end = end # End timestamp in seconds
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return {"name": self.name, "start": self.start, "end": self.end}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> "Skill":
|
||||
return cls(name=data["name"], start=data["start"], end=data["end"])
|
||||
|
||||
def __repr__(self) -> str:
|
||||
return f"Skill(name='{self.name}', start={self.start:.2f}, end={self.end:.2f})"
|
||||
|
||||
|
||||
class EpisodeSkills:
|
||||
"""Container for all skills in an episode."""
|
||||
|
||||
def __init__(self, episode_index: int, description: str, skills: list[Skill]):
|
||||
self.episode_index = episode_index
|
||||
self.description = description
|
||||
self.skills = skills
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return {
|
||||
"episode_index": self.episode_index,
|
||||
"description": self.description,
|
||||
"skills": [s.to_dict() for s in self.skills],
|
||||
}
|
||||
|
||||
|
||||
# Video Extraction Utilities
|
||||
|
||||
|
||||
class VideoExtractor:
|
||||
"""Utilities for extracting and processing video segments from LeRobot datasets."""
|
||||
|
||||
def __init__(self) -> None:
|
||||
pass
|
||||
|
||||
def extract_episode_video(
|
||||
self,
|
||||
video_path: Path,
|
||||
start_timestamp: float,
|
||||
end_timestamp: float,
|
||||
target_fps: int = 1,
|
||||
) -> Path:
|
||||
"""
|
||||
Extract a specific episode segment from a concatenated video file.
|
||||
|
||||
Args:
|
||||
video_path: Path to the source video file
|
||||
start_timestamp: Start time in seconds
|
||||
end_timestamp: End time in seconds
|
||||
target_fps: Target frames per second for output
|
||||
|
||||
Returns:
|
||||
Path to the extracted temporary video file
|
||||
"""
|
||||
with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as tmp_file:
|
||||
tmp_path = Path(tmp_file.name)
|
||||
|
||||
duration = end_timestamp - start_timestamp
|
||||
|
||||
print(f"Extracting: {start_timestamp:.1f}s - {end_timestamp:.1f}s ({duration:.1f}s)")
|
||||
|
||||
cmd = [
|
||||
"ffmpeg",
|
||||
"-i",
|
||||
str(video_path),
|
||||
"-ss",
|
||||
str(start_timestamp),
|
||||
"-t",
|
||||
str(duration),
|
||||
"-r",
|
||||
str(target_fps),
|
||||
"-c:v",
|
||||
"libx264",
|
||||
"-preset",
|
||||
"ultrafast",
|
||||
"-crf",
|
||||
"23",
|
||||
"-an",
|
||||
"-y",
|
||||
str(tmp_path),
|
||||
]
|
||||
|
||||
try:
|
||||
subprocess.run(cmd, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, check=True)
|
||||
except subprocess.CalledProcessError as e:
|
||||
raise RuntimeError(f"FFmpeg failed: {e}") from e
|
||||
except FileNotFoundError as e:
|
||||
raise RuntimeError("FFmpeg not found. Please install ffmpeg.") from e
|
||||
|
||||
if not tmp_path.exists() or tmp_path.stat().st_size < 1024:
|
||||
if tmp_path.exists():
|
||||
tmp_path.unlink()
|
||||
raise RuntimeError("Video extraction produced invalid file")
|
||||
|
||||
return tmp_path
|
||||
|
||||
def add_timer_overlay(self, video_path: Path) -> Path:
|
||||
"""
|
||||
Add a visible timer overlay to each frame (elapsed time in seconds) in one corner.
|
||||
Used so the VLM can read the timestamp from the image instead of relying on file metadata.
|
||||
Draws a black box with white text at top-right. Writes to a new temporary file and returns its path.
|
||||
"""
|
||||
with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as out_file:
|
||||
out_path = Path(out_file.name)
|
||||
|
||||
cap = cv2.VideoCapture(str(video_path))
|
||||
if not cap.isOpened():
|
||||
raise RuntimeError("Failed to open video")
|
||||
|
||||
fps = cap.get(cv2.CAP_PROP_FPS) or 1.0
|
||||
w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
|
||||
h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
|
||||
fourcc = cv2.VideoWriter_fourcc(*"mp4v")
|
||||
writer = cv2.VideoWriter(str(out_path), fourcc, fps, (w, h))
|
||||
|
||||
font = cv2.FONT_HERSHEY_SIMPLEX
|
||||
font_scale = max(1.2, min(h, w) / 350.0)
|
||||
thickness = max(2, int(font_scale))
|
||||
|
||||
padding = 15
|
||||
margin = 30
|
||||
|
||||
frame_idx = 0
|
||||
while True:
|
||||
ret, frame = cap.read()
|
||||
if not ret:
|
||||
break
|
||||
|
||||
t_sec = frame_idx / fps
|
||||
text = f"{t_sec:.2f} s"
|
||||
|
||||
(tw, th), baseline = cv2.getTextSize(text, font, font_scale, thickness)
|
||||
|
||||
# Top-right placement
|
||||
x_text = w - tw - margin - padding
|
||||
y_text = margin + th + padding
|
||||
|
||||
# Rectangle coordinates (black box behind text)
|
||||
x1 = x_text - padding
|
||||
y1 = y_text - th - padding
|
||||
x2 = x_text + tw + padding
|
||||
y2 = y_text + baseline + padding
|
||||
|
||||
# Draw black filled rectangle
|
||||
cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 0, 0), -1)
|
||||
|
||||
# Draw white text
|
||||
cv2.putText(
|
||||
frame,
|
||||
text,
|
||||
(x_text, y_text),
|
||||
font,
|
||||
font_scale,
|
||||
(255, 255, 255),
|
||||
thickness,
|
||||
lineType=cv2.LINE_AA,
|
||||
)
|
||||
|
||||
writer.write(frame)
|
||||
frame_idx += 1
|
||||
|
||||
cap.release()
|
||||
writer.release()
|
||||
if not out_path.exists() or out_path.stat().st_size < 1024:
|
||||
if out_path.exists():
|
||||
out_path.unlink()
|
||||
raise RuntimeError("Timer overlay produced invalid file")
|
||||
return out_path
|
||||
|
||||
def get_video_duration(self, video_path: Path) -> float:
|
||||
"""Get duration of a video file in seconds."""
|
||||
cap = cv2.VideoCapture(str(video_path))
|
||||
fps = cap.get(cv2.CAP_PROP_FPS) or 30
|
||||
frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
|
||||
cap.release()
|
||||
return frame_count / fps
|
||||
|
||||
|
||||
# Skill Annotation Pipeline
|
||||
class SkillAnnotator:
|
||||
"""
|
||||
Main class for annotating LeRobot datasets with skill labels.
|
||||
|
||||
This class orchestrates the full annotation pipeline:
|
||||
1. Load dataset
|
||||
2. Extract video segments for each episode
|
||||
3. Run VLM-based skill segmentation
|
||||
4. Update dataset task metadata
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vlm: "BaseVLM",
|
||||
video_extractor: VideoExtractor | None = None,
|
||||
batch_size: int = 8,
|
||||
add_timer_overlay: bool = True,
|
||||
):
|
||||
self.vlm = vlm
|
||||
self.video_extractor = video_extractor or VideoExtractor()
|
||||
self.batch_size = batch_size
|
||||
self.add_timer_overlay = add_timer_overlay
|
||||
|
||||
def annotate_dataset(
|
||||
self,
|
||||
dataset: LeRobotDataset,
|
||||
video_key: str,
|
||||
episodes: list[int] | None = None,
|
||||
skip_existing: bool = False,
|
||||
subtask_labels: list[str] | None = None,
|
||||
) -> dict[int, EpisodeSkills]:
|
||||
"""
|
||||
Annotate all episodes in a dataset with skill labels using batched processing.
|
||||
|
||||
Args:
|
||||
dataset: LeRobot dataset to annotate
|
||||
video_key: Key for video observations (e.g., "observation.images.base")
|
||||
episodes: Specific episode indices to annotate (None = all)
|
||||
skip_existing: Skip episodes that already have skill annotations
|
||||
subtask_labels: If provided, model must choose only from these labels (closed vocabulary)
|
||||
|
||||
Returns:
|
||||
Dictionary mapping episode index to EpisodeSkills
|
||||
"""
|
||||
episode_indices = episodes or list(range(dataset.meta.total_episodes))
|
||||
annotations: dict[int, EpisodeSkills] = {}
|
||||
failed_episodes: dict[int, str] = {} # Track failed episodes with error messages
|
||||
|
||||
# Get coarse task description if available
|
||||
coarse_goal = self._get_coarse_goal(dataset)
|
||||
|
||||
# Filter out episodes that already have annotations if skip_existing is True
|
||||
if skip_existing:
|
||||
existing_annotations = load_skill_annotations(dataset.root)
|
||||
if existing_annotations and "episodes" in existing_annotations:
|
||||
# Only skip episodes that exist AND have non-empty skills
|
||||
existing_episode_indices = set()
|
||||
for idx_str, episode_data in existing_annotations["episodes"].items():
|
||||
idx = int(idx_str)
|
||||
# Check if skills list exists and is not empty
|
||||
if "skills" in episode_data and episode_data["skills"]:
|
||||
existing_episode_indices.add(idx)
|
||||
|
||||
original_count = len(episode_indices)
|
||||
episode_indices = [ep for ep in episode_indices if ep not in existing_episode_indices]
|
||||
skipped_count = original_count - len(episode_indices)
|
||||
if skipped_count > 0:
|
||||
print(f"Skipping {skipped_count} episodes with existing non-empty annotations")
|
||||
|
||||
if not episode_indices:
|
||||
print("No episodes to annotate (all already annotated)")
|
||||
return annotations
|
||||
|
||||
print(f"Annotating {len(episode_indices)} episodes in batches of {self.batch_size}...")
|
||||
|
||||
# Process episodes in batches
|
||||
for batch_start in range(0, len(episode_indices), self.batch_size):
|
||||
batch_end = min(batch_start + self.batch_size, len(episode_indices))
|
||||
batch_episodes = episode_indices[batch_start:batch_end]
|
||||
|
||||
print(
|
||||
f"Processing batch {batch_start // self.batch_size + 1}/{(len(episode_indices) + self.batch_size - 1) // self.batch_size} (episodes {batch_episodes[0]} to {batch_episodes[-1]})..."
|
||||
)
|
||||
|
||||
try:
|
||||
batch_annotations = self._annotate_episodes_batch(
|
||||
dataset, batch_episodes, video_key, coarse_goal, subtask_labels
|
||||
)
|
||||
|
||||
for ep_idx in batch_episodes:
|
||||
if ep_idx in batch_annotations and batch_annotations[ep_idx]:
|
||||
skills = batch_annotations[ep_idx]
|
||||
annotations[ep_idx] = EpisodeSkills(
|
||||
episode_index=ep_idx,
|
||||
description=coarse_goal,
|
||||
skills=skills,
|
||||
)
|
||||
print(f" Episode {ep_idx}: {len(skills)} skills identified")
|
||||
else:
|
||||
failed_episodes[ep_idx] = "Empty or missing skills from batch processing"
|
||||
print(f"⚠ Episode {ep_idx}: No skills extracted, will retry")
|
||||
except Exception as e:
|
||||
print(f"✗ Batch failed: {e}. Falling back to single-episode processing...")
|
||||
# Fallback: process episodes one by one
|
||||
for ep_idx in batch_episodes:
|
||||
try:
|
||||
skills = self._annotate_episode(
|
||||
dataset, ep_idx, video_key, coarse_goal, subtask_labels
|
||||
)
|
||||
if skills:
|
||||
annotations[ep_idx] = EpisodeSkills(
|
||||
episode_index=ep_idx,
|
||||
description=coarse_goal,
|
||||
skills=skills,
|
||||
)
|
||||
print(f" Episode {ep_idx}: {len(skills)} skills identified")
|
||||
else:
|
||||
failed_episodes[ep_idx] = "Empty skills list from single-episode processing"
|
||||
print(f"⚠ Episode {ep_idx}: No skills extracted, will retry")
|
||||
except Exception as ep_error:
|
||||
failed_episodes[ep_idx] = str(ep_error)
|
||||
print(f"⚠ Episode {ep_idx} failed: {ep_error}, will retry")
|
||||
|
||||
# Retry failed episodes one more time
|
||||
if failed_episodes:
|
||||
print(f"\nRetrying {len(failed_episodes)} failed episodes...")
|
||||
retry_count = 0
|
||||
for ep_idx, error_msg in list(failed_episodes.items()):
|
||||
print(f"Retry attempt for episode {ep_idx} (previous error: {error_msg})")
|
||||
try:
|
||||
skills = self._annotate_episode(dataset, ep_idx, video_key, coarse_goal, subtask_labels)
|
||||
if skills:
|
||||
annotations[ep_idx] = EpisodeSkills(
|
||||
episode_index=ep_idx,
|
||||
description=coarse_goal,
|
||||
skills=skills,
|
||||
)
|
||||
print(f" Episode {ep_idx} (retry): {len(skills)} skills identified")
|
||||
del failed_episodes[ep_idx]
|
||||
retry_count += 1
|
||||
else:
|
||||
print(f"✗ Episode {ep_idx} (retry): Still no skills extracted")
|
||||
except Exception as retry_error:
|
||||
failed_episodes[ep_idx] = str(retry_error)
|
||||
print(f"✗ Episode {ep_idx} (retry) failed: {retry_error}")
|
||||
|
||||
if retry_count > 0:
|
||||
print(f"Successfully recovered {retry_count} episodes on retry")
|
||||
|
||||
if failed_episodes:
|
||||
print(f"\n⚠ Warning: {len(failed_episodes)} episodes still failed after retry:")
|
||||
for ep_idx, error_msg in failed_episodes.items():
|
||||
print(f" Episode {ep_idx}: {error_msg}")
|
||||
|
||||
return annotations
|
||||
|
||||
def _get_coarse_goal(self, dataset: LeRobotDataset) -> str:
|
||||
"""Extract or generate the coarse task description."""
|
||||
# Try to get from existing task metadata
|
||||
if dataset.meta.tasks is not None and len(dataset.meta.tasks) > 0:
|
||||
# Get the first task description
|
||||
first_task = dataset.meta.tasks.index[0]
|
||||
if first_task:
|
||||
return str(first_task)
|
||||
|
||||
return "Perform the demonstrated manipulation task."
|
||||
|
||||
def _annotate_episodes_batch(
|
||||
self,
|
||||
dataset: LeRobotDataset,
|
||||
episode_indices: list[int],
|
||||
video_key: str,
|
||||
coarse_goal: str,
|
||||
subtask_labels: list[str] | None = None,
|
||||
) -> dict[int, list[Skill]]:
|
||||
"""Annotate multiple episodes with skill labels in a batch."""
|
||||
# Extract all videos for this batch
|
||||
extracted_paths = []
|
||||
timer_paths = []
|
||||
paths_for_vlm = []
|
||||
durations = []
|
||||
valid_episode_indices = []
|
||||
|
||||
for ep_idx in episode_indices:
|
||||
try:
|
||||
# Get video path and timestamps
|
||||
video_path = dataset.root / dataset.meta.get_video_file_path(ep_idx, video_key)
|
||||
|
||||
if not video_path.exists():
|
||||
print(f"Warning: Video not found for episode {ep_idx}")
|
||||
continue
|
||||
|
||||
# Get episode timestamps from metadata
|
||||
ep = dataset.meta.episodes[ep_idx]
|
||||
start_ts = float(ep[f"videos/{video_key}/from_timestamp"])
|
||||
end_ts = float(ep[f"videos/{video_key}/to_timestamp"])
|
||||
duration = end_ts - start_ts
|
||||
|
||||
# Extract episode segment to temporary file
|
||||
extracted_path = self.video_extractor.extract_episode_video(
|
||||
video_path, start_ts, end_ts, target_fps=dataset.meta.fps
|
||||
)
|
||||
|
||||
if self.add_timer_overlay:
|
||||
video_for_vlm = self.video_extractor.add_timer_overlay(extracted_path)
|
||||
extracted_paths.append(extracted_path)
|
||||
timer_paths.append(video_for_vlm)
|
||||
else:
|
||||
video_for_vlm = extracted_path
|
||||
extracted_paths.append(extracted_path)
|
||||
timer_paths.append(None)
|
||||
|
||||
paths_for_vlm.append(video_for_vlm)
|
||||
durations.append(duration)
|
||||
valid_episode_indices.append(ep_idx)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Warning: Failed to extract video for episode {ep_idx}: {e}")
|
||||
continue
|
||||
|
||||
if not paths_for_vlm:
|
||||
return {}
|
||||
|
||||
try:
|
||||
# Run VLM skill segmentation in batch
|
||||
all_skills = self.vlm.segment_skills_batch(paths_for_vlm, durations, coarse_goal, subtask_labels)
|
||||
|
||||
# Map results back to episode indices
|
||||
results = {}
|
||||
for ep_idx, skills in zip(valid_episode_indices, all_skills, strict=True):
|
||||
results[ep_idx] = skills
|
||||
|
||||
return results
|
||||
|
||||
finally:
|
||||
# Clean up all temporary files (extracted and timer-overlay)
|
||||
for path in extracted_paths:
|
||||
if path.exists():
|
||||
path.unlink()
|
||||
for path in timer_paths:
|
||||
if path is not None and path.exists():
|
||||
path.unlink()
|
||||
|
||||
def _annotate_episode(
|
||||
self,
|
||||
dataset: LeRobotDataset,
|
||||
episode_index: int,
|
||||
video_key: str,
|
||||
coarse_goal: str,
|
||||
subtask_labels: list[str] | None = None,
|
||||
) -> list[Skill]:
|
||||
"""Annotate a single episode with skill labels."""
|
||||
# Get video path and timestamps for this episode
|
||||
video_path = dataset.root / dataset.meta.get_video_file_path(episode_index, video_key)
|
||||
|
||||
if not video_path.exists():
|
||||
raise FileNotFoundError(f"Video not found: {video_path}")
|
||||
|
||||
# Get episode timestamps from metadata
|
||||
ep = dataset.meta.episodes[episode_index]
|
||||
start_ts = float(ep[f"videos/{video_key}/from_timestamp"])
|
||||
end_ts = float(ep[f"videos/{video_key}/to_timestamp"])
|
||||
duration = end_ts - start_ts
|
||||
|
||||
# Extract episode segment to temporary file
|
||||
extracted_path = self.video_extractor.extract_episode_video(
|
||||
video_path, start_ts, end_ts, target_fps=1
|
||||
)
|
||||
if self.add_timer_overlay:
|
||||
video_for_vlm = self.video_extractor.add_timer_overlay(extracted_path)
|
||||
else:
|
||||
video_for_vlm = extracted_path
|
||||
|
||||
try:
|
||||
# Run VLM skill segmentation
|
||||
skills = self.vlm.segment_skills(video_for_vlm, duration, coarse_goal, subtask_labels)
|
||||
return skills
|
||||
finally:
|
||||
# Clean up temporary files (extracted and optionally timer-overlay)
|
||||
if extracted_path.exists():
|
||||
extracted_path.unlink()
|
||||
if self.add_timer_overlay and video_for_vlm != extracted_path and video_for_vlm.exists():
|
||||
video_for_vlm.unlink()
|
||||
|
||||
|
||||
# Metadata Writer - Updates per-frame task_index based on skills
|
||||
|
||||
|
||||
def get_skill_for_timestamp(skills: list[Skill], timestamp: float) -> Skill | None:
|
||||
"""
|
||||
Find which skill covers a given timestamp.
|
||||
|
||||
Args:
|
||||
skills: List of skills with start/end times
|
||||
timestamp: Frame timestamp in seconds
|
||||
|
||||
Returns:
|
||||
The Skill that covers this timestamp, or None if not found
|
||||
"""
|
||||
for skill in skills:
|
||||
if skill.start <= timestamp < skill.end:
|
||||
return skill
|
||||
# Handle the last frame (end boundary)
|
||||
if timestamp >= skill.end and skill == skills[-1]:
|
||||
return skill
|
||||
return skills[-1] if skills else None # Fallback to last skill
|
||||
|
||||
|
||||
def save_skill_annotations(
|
||||
dataset: LeRobotDataset,
|
||||
annotations: dict[int, EpisodeSkills],
|
||||
output_dir: Path | None = None,
|
||||
repo_id: str | None = None,
|
||||
) -> LeRobotDataset:
|
||||
"""
|
||||
Save skill annotations to the dataset by:
|
||||
1. Creating a subtasks.parquet file with unique subtasks
|
||||
2. Adding a subtask_index feature to the dataset
|
||||
3. Saving raw skill annotations as JSON for reference
|
||||
|
||||
This function does NOT modify tasks.parquet - it keeps the original tasks intact
|
||||
and creates a separate subtask hierarchy.
|
||||
|
||||
Args:
|
||||
dataset: The LeRobot dataset to annotate
|
||||
annotations: Dictionary of episode skills
|
||||
output_dir: Optional directory to save the modified dataset
|
||||
repo_id: Optional repository ID for the new dataset
|
||||
|
||||
Returns:
|
||||
New dataset with subtask_index feature added
|
||||
"""
|
||||
if not annotations:
|
||||
print("No annotations to save")
|
||||
return dataset
|
||||
|
||||
# Step 1: Create subtasks DataFrame
|
||||
print("Creating subtasks DataFrame...")
|
||||
subtasks_df, skill_to_subtask_idx = create_subtasks_dataframe(annotations)
|
||||
|
||||
# Step 2: Create subtask_index array for all frames
|
||||
print("Creating subtask_index array...")
|
||||
subtask_indices = create_subtask_index_array(dataset, annotations, skill_to_subtask_idx)
|
||||
|
||||
# Step 3: Save subtasks.parquet to the original dataset root
|
||||
save_subtasks(subtasks_df, dataset.root)
|
||||
|
||||
# Step 4: Save the raw skill annotations as JSON for reference
|
||||
skills_path = dataset.root / "meta" / "skills.json"
|
||||
skills_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Load existing skills data if it exists and is not empty
|
||||
existing_skills_data = None
|
||||
if skills_path.exists():
|
||||
try:
|
||||
with open(skills_path) as f:
|
||||
existing_skills_data = json.load(f)
|
||||
if existing_skills_data and len(existing_skills_data.get("episodes", {})) > 0:
|
||||
print(
|
||||
f"Found existing skills.json with {len(existing_skills_data.get('episodes', {}))} episodes, merging..."
|
||||
)
|
||||
except (OSError, json.JSONDecodeError):
|
||||
print("Warning: Could not load existing skills.json, will create new file")
|
||||
existing_skills_data = None
|
||||
|
||||
# Prepare new annotations
|
||||
new_episodes = {str(ep_idx): ann.to_dict() for ep_idx, ann in annotations.items()}
|
||||
|
||||
# Merge with existing data if available
|
||||
if existing_skills_data:
|
||||
# Preserve existing episodes that are not being updated
|
||||
merged_episodes = existing_skills_data.get("episodes", {}).copy()
|
||||
merged_episodes.update(new_episodes)
|
||||
|
||||
# Merge skill_to_subtask_index mappings
|
||||
merged_skill_to_subtask = existing_skills_data.get("skill_to_subtask_index", {}).copy()
|
||||
merged_skill_to_subtask.update(skill_to_subtask_idx)
|
||||
|
||||
# Use existing coarse_description if available, otherwise use new one
|
||||
coarse_desc = existing_skills_data.get(
|
||||
"coarse_description", annotations[next(iter(annotations))].description
|
||||
)
|
||||
|
||||
skills_data = {
|
||||
"coarse_description": coarse_desc,
|
||||
"skill_to_subtask_index": merged_skill_to_subtask,
|
||||
"episodes": merged_episodes,
|
||||
}
|
||||
print(
|
||||
f"Updated {len(new_episodes)} episode(s), total episodes in skills.json: {len(merged_episodes)}"
|
||||
)
|
||||
else:
|
||||
# No existing data, create new
|
||||
skills_data = {
|
||||
"coarse_description": annotations[next(iter(annotations))].description,
|
||||
"skill_to_subtask_index": skill_to_subtask_idx,
|
||||
"episodes": new_episodes,
|
||||
}
|
||||
|
||||
with open(skills_path, "w") as f:
|
||||
json.dump(skills_data, f, indent=2)
|
||||
|
||||
print(f" Saved skill annotations to {skills_path}")
|
||||
|
||||
# Step 5: Add subtask_index feature to dataset using add_features
|
||||
print("Adding subtask_index feature to dataset...")
|
||||
|
||||
# Determine output directory and repo_id
|
||||
output_dir = dataset.root.parent / f"{dataset.root.name}" if output_dir is None else Path(output_dir)
|
||||
|
||||
if repo_id is None:
|
||||
repo_id = f"{dataset.repo_id}"
|
||||
|
||||
# Add feature using dataset_tools
|
||||
feature_info = {
|
||||
"dtype": "int64",
|
||||
"shape": (1,),
|
||||
"names": None,
|
||||
}
|
||||
new_dataset = add_features(
|
||||
dataset=dataset,
|
||||
features={
|
||||
"subtask_index": (subtask_indices, feature_info),
|
||||
},
|
||||
output_dir=output_dir,
|
||||
repo_id=repo_id,
|
||||
)
|
||||
|
||||
# Copy subtasks.parquet to new output directory
|
||||
import shutil
|
||||
|
||||
shutil.copy(dataset.root / "meta" / "subtasks.parquet", output_dir / "meta" / "subtasks.parquet")
|
||||
shutil.copy(dataset.root / "meta" / "skills.json", output_dir / "meta" / "skills.json")
|
||||
|
||||
print(" Successfully added subtask_index feature!")
|
||||
print(f" New dataset saved to: {new_dataset.root}")
|
||||
print(f" Total subtasks: {len(subtasks_df)}")
|
||||
|
||||
return new_dataset
|
||||
|
||||
|
||||
def load_skill_annotations(dataset_root: Path) -> dict | None:
|
||||
"""Load existing skill annotations from a dataset."""
|
||||
skills_path = dataset_root / "meta" / "skills.json"
|
||||
if skills_path.exists():
|
||||
with open(skills_path) as f:
|
||||
return json.load(f)
|
||||
return None
|
||||
@@ -0,0 +1,271 @@
|
||||
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
from abc import ABC, abstractmethod
|
||||
from pathlib import Path
|
||||
|
||||
import torch
|
||||
|
||||
from lerobot.data_processing.data_annotations.subtask_annotations import Skill
|
||||
from lerobot.utils.constants import (
|
||||
SKILL_SEGMENTATION_PROMPT_TEMPLATE,
|
||||
format_subtask_labels_section,
|
||||
)
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
DEFAULT_MODEL = "Qwen/Qwen3.5-27B"
|
||||
|
||||
|
||||
def create_skill_segmentation_prompt(
|
||||
coarse_goal: str | None = None,
|
||||
subtask_labels: list[str] | None = None,
|
||||
duration_seconds: float | None = None,
|
||||
) -> str:
|
||||
"""Create the prompt for skill segmentation using the template from constants."""
|
||||
if duration_seconds is None:
|
||||
raise ValueError("duration_seconds is required for skill segmentation prompt")
|
||||
goal_context = f'The overall goal is: "{coarse_goal}"\n\n' if coarse_goal else ""
|
||||
subtask_labels_section = format_subtask_labels_section(subtask_labels) if subtask_labels else ""
|
||||
video_duration_mm_ss = f"{int(duration_seconds // 60):02d}:{int(duration_seconds % 60):02d}"
|
||||
return SKILL_SEGMENTATION_PROMPT_TEMPLATE.format(
|
||||
goal_context=goal_context,
|
||||
subtask_labels_section=subtask_labels_section,
|
||||
video_duration_seconds=duration_seconds,
|
||||
video_duration_mm_ss=video_duration_mm_ss,
|
||||
)
|
||||
|
||||
|
||||
class BaseVLM(ABC):
|
||||
"""
|
||||
Abstract base class for Vision-Language Models used in skill segmentation.
|
||||
|
||||
To add a new VLM family:
|
||||
1. Subclass BaseVLM
|
||||
2. Implement __init__, segment_skills, and segment_skills_batch
|
||||
3. Register it in get_vlm()
|
||||
"""
|
||||
|
||||
@abstractmethod
|
||||
def __init__(self, model_name: str, device: str = "cuda", torch_dtype: torch.dtype = torch.bfloat16):
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def segment_skills(
|
||||
self,
|
||||
video_path: Path,
|
||||
episode_duration: float,
|
||||
coarse_goal: str | None = None,
|
||||
subtask_labels: list[str] | None = None,
|
||||
) -> list[Skill]:
|
||||
"""Segment a single video into atomic skills."""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def segment_skills_batch(
|
||||
self,
|
||||
video_paths: list[Path],
|
||||
episode_durations: list[float],
|
||||
coarse_goal: str | None = None,
|
||||
subtask_labels: list[str] | None = None,
|
||||
) -> list[list[Skill]]:
|
||||
"""Segment multiple videos into atomic skills in a single batch."""
|
||||
pass
|
||||
|
||||
def _parse_skills_response(self, response: str) -> list[Skill]:
|
||||
"""Parse JSON skill list from VLM response text."""
|
||||
if "```json" in response:
|
||||
response = response.split("```json")[1].split("```")[0]
|
||||
elif "```" in response:
|
||||
response = response.split("```")[1].split("```")[0]
|
||||
|
||||
try:
|
||||
data = json.loads(response)
|
||||
skills_data = data.get("skills", data)
|
||||
if isinstance(skills_data, list):
|
||||
return [Skill.from_dict(s) for s in skills_data]
|
||||
except json.JSONDecodeError:
|
||||
match = re.search(r"\{.*\}", response, re.DOTALL)
|
||||
if match:
|
||||
try:
|
||||
data = json.loads(match.group())
|
||||
skills_data = data.get("skills", [])
|
||||
return [Skill.from_dict(s) for s in skills_data]
|
||||
except json.JSONDecodeError as e:
|
||||
raise ValueError(f"Could not parse JSON from VLM response: {response[:200]}...") from e
|
||||
|
||||
raise ValueError(f"Could not parse skills from response: {response[:200]}...")
|
||||
|
||||
|
||||
class QwenVL(BaseVLM):
|
||||
"""Qwen VL model for skill segmentation (default: Qwen3.5 series).
|
||||
|
||||
Uses qwen-vl-utils for video processing and the HuggingFace transformers
|
||||
Qwen3VLProcessor pipeline. Requires transformers >= 5.4.0 for correct
|
||||
video position embeddings.
|
||||
"""
|
||||
|
||||
def __init__(self, model_name: str, device: str = "cuda", torch_dtype: torch.dtype = torch.bfloat16):
|
||||
from qwen_vl_utils import process_vision_info
|
||||
from transformers import AutoModelForImageTextToText, AutoProcessor
|
||||
|
||||
self.device = device
|
||||
self.model_name = model_name
|
||||
self.process_vision_info = process_vision_info
|
||||
|
||||
logger.info(f"Loading model: {model_name}...")
|
||||
|
||||
self.model = AutoModelForImageTextToText.from_pretrained(
|
||||
model_name, torch_dtype=torch_dtype, device_map=device, trust_remote_code=True
|
||||
)
|
||||
self.processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
|
||||
self.processor.tokenizer.padding_side = "left"
|
||||
|
||||
logger.info(f"Model loaded on {device}")
|
||||
|
||||
def _build_messages(self, video_path: Path, episode_duration: float, prompt: str) -> list[dict]:
|
||||
duration_str = f"{int(episode_duration // 60):02d}:{int(episode_duration % 60):02d}"
|
||||
return [
|
||||
{"role": "system", "content": [{"type": "text", "text": prompt}]},
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "video", "video": str(video_path), "fps": 1.0},
|
||||
{
|
||||
"type": "text",
|
||||
"text": (
|
||||
f"Video duration: {duration_str} (exactly {episode_duration:.1f} seconds). "
|
||||
f"Segment into atomic skills. Last skill must end at {episode_duration:.1f}."
|
||||
),
|
||||
},
|
||||
],
|
||||
},
|
||||
]
|
||||
|
||||
def _prepare_inputs(self, messages: list[dict]) -> dict:
|
||||
"""Tokenize a single message and return processor inputs on device."""
|
||||
text = self.processor.apply_chat_template(
|
||||
messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
|
||||
)
|
||||
image_inputs, video_inputs = self.process_vision_info(messages, return_video_metadata=True)
|
||||
|
||||
videos, video_metadata = None, None
|
||||
if video_inputs:
|
||||
videos = [v[0] for v in video_inputs]
|
||||
video_metadata = [v[1] for v in video_inputs]
|
||||
|
||||
return self.processor(
|
||||
text=[text],
|
||||
images=image_inputs,
|
||||
videos=videos,
|
||||
videos_kwargs={
|
||||
"video_metadata": video_metadata,
|
||||
"do_sample_frames": False,
|
||||
},
|
||||
padding=True,
|
||||
return_tensors="pt",
|
||||
).to(self.device)
|
||||
|
||||
def _decode(self, inputs, generated_ids) -> list[str]:
|
||||
return self.processor.batch_decode(
|
||||
[out[len(inp) :] for inp, out in zip(inputs.input_ids, generated_ids, strict=True)],
|
||||
skip_special_tokens=True,
|
||||
clean_up_tokenization_spaces=False,
|
||||
)
|
||||
|
||||
def segment_skills(
|
||||
self,
|
||||
video_path: Path,
|
||||
episode_duration: float,
|
||||
coarse_goal: str | None = None,
|
||||
subtask_labels: list[str] | None = None,
|
||||
) -> list[Skill]:
|
||||
prompt = create_skill_segmentation_prompt(
|
||||
coarse_goal, subtask_labels, duration_seconds=episode_duration
|
||||
)
|
||||
messages = self._build_messages(video_path, episode_duration, prompt)
|
||||
inputs = self._prepare_inputs(messages)
|
||||
|
||||
with torch.no_grad():
|
||||
generated_ids = self.model.generate(
|
||||
**inputs, max_new_tokens=1024, do_sample=True, temperature=0.7
|
||||
)
|
||||
|
||||
response = self._decode(inputs, generated_ids)[0].strip()
|
||||
return self._parse_skills_response(response)
|
||||
|
||||
def segment_skills_batch(
|
||||
self,
|
||||
video_paths: list[Path],
|
||||
episode_durations: list[float],
|
||||
coarse_goal: str | None = None,
|
||||
subtask_labels: list[str] | None = None,
|
||||
) -> list[list[Skill]]:
|
||||
all_texts = []
|
||||
all_video_tuples: list[tuple] = []
|
||||
|
||||
for video_path, duration in zip(video_paths, episode_durations, strict=True):
|
||||
prompt = create_skill_segmentation_prompt(coarse_goal, subtask_labels, duration_seconds=duration)
|
||||
messages = self._build_messages(video_path, duration, prompt)
|
||||
|
||||
text = self.processor.apply_chat_template(
|
||||
messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
|
||||
)
|
||||
_image_inputs, video_inputs = self.process_vision_info(messages, return_video_metadata=True)
|
||||
all_texts.append(text)
|
||||
all_video_tuples.extend(video_inputs or [])
|
||||
|
||||
videos, video_metadata = None, None
|
||||
if all_video_tuples:
|
||||
videos = [v[0] for v in all_video_tuples]
|
||||
video_metadata = [v[1] for v in all_video_tuples]
|
||||
|
||||
inputs = self.processor(
|
||||
text=all_texts,
|
||||
videos=videos,
|
||||
videos_kwargs={
|
||||
"video_metadata": video_metadata,
|
||||
"do_sample_frames": False,
|
||||
},
|
||||
padding=True,
|
||||
return_tensors="pt",
|
||||
).to(self.device)
|
||||
|
||||
with torch.no_grad():
|
||||
generated_ids = self.model.generate(
|
||||
**inputs, max_new_tokens=1024, do_sample=True, temperature=0.7
|
||||
)
|
||||
|
||||
responses = self._decode(inputs, generated_ids)
|
||||
|
||||
all_skills = []
|
||||
for idx, response in enumerate(responses):
|
||||
try:
|
||||
skills = self._parse_skills_response(response.strip())
|
||||
if not skills:
|
||||
logger.warning(f"No skills parsed for video {idx}")
|
||||
all_skills.append(skills)
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to parse response for video {idx}: {e}")
|
||||
all_skills.append([])
|
||||
|
||||
return all_skills
|
||||
|
||||
|
||||
def get_vlm(model_name: str, device: str = "cuda", torch_dtype: torch.dtype = torch.bfloat16) -> BaseVLM:
|
||||
"""Create a VLM instance. Defaults to QwenVL which supports the Qwen3.5 series."""
|
||||
return QwenVL(model_name, device, torch_dtype)
|
||||
@@ -1,4 +1,5 @@
|
||||
#!/usr/bin/env python
|
||||
from __future__ import annotations
|
||||
|
||||
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
@@ -21,7 +22,11 @@ from collections import deque
|
||||
from collections.abc import Iterable, Iterator
|
||||
from pathlib import Path
|
||||
from pprint import pformat
|
||||
from typing import Any
|
||||
from typing import TYPE_CHECKING, Any
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from lerobot.data_processing.data_annotations.subtask_annotations import EpisodeSkills
|
||||
from lerobot.datasets.lerobot_dataset import LeRobotDataset
|
||||
|
||||
import datasets
|
||||
import numpy as np
|
||||
@@ -1216,6 +1221,111 @@ def find_float_index(target, float_list, threshold=1e-6):
|
||||
return -1
|
||||
|
||||
|
||||
def create_subtasks_dataframe(
|
||||
annotations: dict[int, EpisodeSkills],
|
||||
) -> tuple[pd.DataFrame, dict[str, int]]:
|
||||
"""
|
||||
Create a subtasks DataFrame from skill annotations.
|
||||
|
||||
Args:
|
||||
annotations: Dictionary of episode skills
|
||||
|
||||
Returns:
|
||||
Tuple of (subtasks_df, skill_to_subtask_idx mapping)
|
||||
"""
|
||||
# Collect all unique skill names
|
||||
all_skill_names: set[str] = set()
|
||||
for episode_skills in annotations.values():
|
||||
for skill in episode_skills.skills:
|
||||
all_skill_names.add(skill.name)
|
||||
# Build subtasks DataFrame
|
||||
subtask_data = []
|
||||
for i, skill_name in enumerate(sorted(all_skill_names)):
|
||||
subtask_data.append(
|
||||
{
|
||||
"subtask": skill_name,
|
||||
"subtask_index": i,
|
||||
}
|
||||
)
|
||||
|
||||
if not subtask_data:
|
||||
subtasks_df = pd.DataFrame(columns=["subtask", "subtask_index"]).set_index("subtask")
|
||||
else:
|
||||
subtasks_df = pd.DataFrame(subtask_data).set_index("subtask")
|
||||
|
||||
# Build skill name to subtask_index mapping
|
||||
skill_to_subtask_idx = {
|
||||
skill_name: int(subtasks_df.loc[skill_name, "subtask_index"]) for skill_name in all_skill_names
|
||||
}
|
||||
|
||||
return subtasks_df, skill_to_subtask_idx
|
||||
|
||||
|
||||
def save_subtasks(
|
||||
subtasks_df: pd.DataFrame,
|
||||
dataset_root: Path,
|
||||
) -> None:
|
||||
"""Save subtasks to subtasks.parquet."""
|
||||
output_path = dataset_root / "meta" / "subtasks.parquet"
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
subtasks_df.to_parquet(output_path, engine="pyarrow", compression="snappy")
|
||||
|
||||
|
||||
def create_subtask_index_array(
|
||||
dataset: LeRobotDataset,
|
||||
annotations: dict[int, EpisodeSkills],
|
||||
skill_to_subtask_idx: dict[str, int],
|
||||
) -> np.ndarray:
|
||||
"""
|
||||
Create a subtask_index array for each frame based on skill annotations.
|
||||
|
||||
Args:
|
||||
dataset: The LeRobot dataset
|
||||
annotations: Dictionary of episode skills
|
||||
skill_to_subtask_idx: Mapping from skill name to subtask_index
|
||||
|
||||
Returns:
|
||||
Array of subtask indices for each frame in the dataset
|
||||
"""
|
||||
# Array to store subtask index for each frame
|
||||
# Initialize with -1 to indicate unannotated frames
|
||||
full_dataset_length = len(dataset)
|
||||
subtask_indices = np.full(full_dataset_length, -1, dtype=np.int64)
|
||||
|
||||
# Assign subtask_index for each annotated episode
|
||||
fps = float(dataset.meta.fps)
|
||||
for ep_idx, episode_skills in annotations.items():
|
||||
skills = episode_skills.skills
|
||||
|
||||
# Get episode frame range
|
||||
ep = dataset.meta.episodes[ep_idx]
|
||||
ep_from = int(ep["dataset_from_index"])
|
||||
ep_to = int(ep["dataset_to_index"])
|
||||
|
||||
# Process each frame in the episode (compute timestamp from index to avoid loading video)
|
||||
for frame_idx in range(ep_from, ep_to):
|
||||
timestamp = (frame_idx - ep_from) / fps
|
||||
|
||||
# Find which skill covers this timestamp (inline to avoid circular import)
|
||||
skill = None
|
||||
for s in skills:
|
||||
if s.start <= timestamp < s.end:
|
||||
skill = s
|
||||
break
|
||||
if timestamp >= s.end and s == skills[-1]:
|
||||
skill = s
|
||||
break
|
||||
if not skill and skills:
|
||||
skill = skills[-1]
|
||||
|
||||
if skill and skill.name in skill_to_subtask_idx:
|
||||
subtask_idx = skill_to_subtask_idx[skill.name]
|
||||
subtask_indices[frame_idx] = subtask_idx
|
||||
|
||||
return subtask_indices
|
||||
|
||||
|
||||
class LookBackError(Exception):
|
||||
"""
|
||||
Exception raised when trying to look back in the history of a Backtrackable object.
|
||||
@@ -1279,7 +1389,7 @@ class Backtrackable[T]:
|
||||
self._history = history
|
||||
self._lookahead = lookahead
|
||||
|
||||
def __iter__(self) -> "Backtrackable[T]":
|
||||
def __iter__(self) -> Backtrackable[T]:
|
||||
return self
|
||||
|
||||
def __next__(self) -> T:
|
||||
|
||||
@@ -0,0 +1,160 @@
|
||||
#!/usr/bin/env python
|
||||
|
||||
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""
|
||||
Automatic Skill Annotation for LeRobot Datasets.
|
||||
|
||||
This script performs automatic subtask/skill labeling for ANY LeRobot dataset using
|
||||
Vision-Language Models (VLMs). It segments each robot demonstration into short atomic
|
||||
skills (1-3 seconds each) and creates a new dataset with subtask annotations.
|
||||
|
||||
The pipeline:
|
||||
1. Loads a LeRobot dataset (local or from HuggingFace Hub)
|
||||
2. For each episode, extracts video frames
|
||||
3. Uses a VLM to identify skill boundaries and labels
|
||||
4. Creates a subtasks.parquet file with unique subtasks
|
||||
5. Adds a subtask_index feature to the dataset
|
||||
|
||||
Supported VLMs (modular design): Qwen2-VL, Qwen3-VL, Qwen3.5-VL (see vlm_annotations.py).
|
||||
|
||||
Usage:
|
||||
lerobot-dataset-subtask-annotate --repo_id=user/dataset --video_key=observation.images.base ...
|
||||
lerobot-dataset-subtask-annotate --data_dir=/path/to/dataset --video_key=observation.images.base ...
|
||||
"""
|
||||
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
import torch
|
||||
|
||||
from lerobot.configs import parser
|
||||
from lerobot.data_processing.data_annotations.subtask_annotations import (
|
||||
SkillAnnotator,
|
||||
save_skill_annotations,
|
||||
)
|
||||
from lerobot.data_processing.data_annotations.vlm_annotations import get_vlm
|
||||
from lerobot.datasets.lerobot_dataset import LeRobotDataset
|
||||
|
||||
|
||||
@dataclass
|
||||
class SubtaskAnnotateConfig:
|
||||
"""Configuration for automatic subtask/skill annotation with VLMs."""
|
||||
|
||||
# Data source: provide exactly one of data_dir (local) or repo_id (Hub)
|
||||
data_dir: str | None = None
|
||||
repo_id: str | None = None
|
||||
# Video observation key (e.g. observation.images.base)
|
||||
video_key: str = "observation.images.base"
|
||||
# VLM model name (default: Qwen/Qwen2-VL-7B-Instruct)
|
||||
model: str = "Qwen/Qwen2-VL-7B-Instruct"
|
||||
device: str = "cuda"
|
||||
dtype: str = "bfloat16"
|
||||
batch_size: int = 8
|
||||
# Episode selection (default: all)
|
||||
episodes: list[int] | None = None
|
||||
skip_existing: bool = False
|
||||
# Output
|
||||
output_dir: str | None = None
|
||||
output_repo_id: str | None = None
|
||||
push_to_hub: bool = False
|
||||
# Closed vocabulary: comma-separated labels (e.g. "label1,label2,label3")
|
||||
subtask_labels: str | None = None
|
||||
# Disable timer overlay on video (by default a timer is drawn for the VLM)
|
||||
no_timer_overlay: bool = False
|
||||
|
||||
|
||||
@parser.wrap()
|
||||
def subtask_annotate(cfg: SubtaskAnnotateConfig):
|
||||
"""
|
||||
Run automatic skill annotation on a LeRobot dataset using a VLM.
|
||||
|
||||
Args:
|
||||
cfg: SubtaskAnnotateConfig with data source, model, and output options.
|
||||
"""
|
||||
if (cfg.data_dir is None) == (cfg.repo_id is None):
|
||||
raise ValueError("Provide exactly one of --data_dir or --repo_id")
|
||||
|
||||
# Parse comma-separated subtask labels into a list (or None)
|
||||
subtask_labels_list: list[str] | None = None
|
||||
if cfg.subtask_labels and cfg.subtask_labels.strip():
|
||||
subtask_labels_list = [s.strip() for s in cfg.subtask_labels.split(",") if s.strip()]
|
||||
|
||||
dtype_map = {
|
||||
"bfloat16": torch.bfloat16,
|
||||
"float16": torch.float16,
|
||||
"float32": torch.float32,
|
||||
}
|
||||
torch_dtype = dtype_map[cfg.dtype]
|
||||
|
||||
print("Loading dataset...")
|
||||
if cfg.data_dir:
|
||||
dataset = LeRobotDataset(repo_id="local/dataset", root=cfg.data_dir, download_videos=False)
|
||||
else:
|
||||
dataset = LeRobotDataset(repo_id=cfg.repo_id, download_videos=True)
|
||||
|
||||
print(f" Loaded dataset with {dataset.meta.total_episodes} episodes")
|
||||
|
||||
if cfg.video_key not in dataset.meta.video_keys:
|
||||
available = ", ".join(dataset.meta.video_keys)
|
||||
raise ValueError(f"Video key '{cfg.video_key}' not found. Available: {available}")
|
||||
|
||||
print(f"Initializing VLM: {cfg.model}...")
|
||||
vlm = get_vlm(cfg.model, cfg.device, torch_dtype)
|
||||
|
||||
add_timer_overlay = not cfg.no_timer_overlay
|
||||
annotator = SkillAnnotator(
|
||||
vlm=vlm,
|
||||
batch_size=cfg.batch_size,
|
||||
add_timer_overlay=add_timer_overlay,
|
||||
)
|
||||
print(f"Processing with batch size: {cfg.batch_size}")
|
||||
annotations = annotator.annotate_dataset(
|
||||
dataset=dataset,
|
||||
video_key=cfg.video_key,
|
||||
episodes=cfg.episodes,
|
||||
skip_existing=cfg.skip_existing,
|
||||
subtask_labels=subtask_labels_list,
|
||||
)
|
||||
|
||||
output_dir = Path(cfg.output_dir) if cfg.output_dir else None
|
||||
output_repo_id = cfg.output_repo_id
|
||||
new_dataset = save_skill_annotations(dataset, annotations, output_dir, output_repo_id)
|
||||
|
||||
total_skills = sum(len(ann.skills) for ann in annotations.values())
|
||||
print("\nAnnotation complete!")
|
||||
print(f"Episodes annotated: {len(annotations)}")
|
||||
print(f"Total subtasks identified: {total_skills}")
|
||||
print(f"Dataset with subtask_index saved to: {new_dataset.root}")
|
||||
|
||||
if cfg.push_to_hub:
|
||||
if cfg.data_dir:
|
||||
print("Warning: --push_to_hub requires --repo_id, skipping...")
|
||||
else:
|
||||
print("Pushing to HuggingFace Hub...")
|
||||
try:
|
||||
new_dataset.push_to_hub(branch="subtasks")
|
||||
print(f" Pushed to {output_repo_id or cfg.repo_id}")
|
||||
except Exception as e:
|
||||
print(f"Push failed: {e}")
|
||||
|
||||
|
||||
def main():
|
||||
"""CLI entry point that parses config and runs subtask annotation."""
|
||||
subtask_annotate()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -89,3 +89,92 @@ LIBERO_KEY_JOINTS_POS = "robot_state/joints/pos"
|
||||
LIBERO_KEY_JOINTS_VEL = "robot_state/joints/vel"
|
||||
LIBERO_KEY_PIXELS_AGENTVIEW = "pixels/agentview_image"
|
||||
LIBERO_KEY_PIXELS_EYE_IN_HAND = "pixels/robot0_eye_in_hand_image"
|
||||
|
||||
|
||||
def format_subtask_labels_section(subtask_labels: list[str]) -> str:
|
||||
"""Format a list of subtask labels for the closed-vocabulary section of the prompt."""
|
||||
return "\n ".join(f'"{label}"' for label in subtask_labels)
|
||||
|
||||
|
||||
SKILL_SEGMENTATION_PROMPT_TEMPLATE = """# Role
|
||||
You are a Robotics Vision System specializing in temporal action segmentation for robot manipulation demonstrations.
|
||||
|
||||
# Video duration (critical)
|
||||
The total video length is **{video_duration_seconds} seconds** ({video_duration_mm_ss}). All "start" and "end" values in your JSON must be numeric seconds in the range [0.0, {video_duration_seconds}]. The last skill's "end" must be exactly **{video_duration_seconds}**. Do not stop earlier.
|
||||
|
||||
# Task
|
||||
{goal_context}Segment this robot demonstration video into short atomic manipulation skills. Each skill should:
|
||||
- Last approximately 1-3 seconds (or longer if the action takes longer)
|
||||
- Describe a clear, single action (e.g., "pick up object", "move arm left", "release gripper")
|
||||
- Have precise start and end timestamps in seconds (float)
|
||||
|
||||
# Requirements
|
||||
1. **Atomic Actions**: Each skill should be a single, indivisible action
|
||||
2. **Complete Coverage**: Skills must cover the entire video from 0.0 to {video_duration_seconds} seconds with no gaps
|
||||
3. **Boundary Consistency**: The end of one skill equals the start of the next
|
||||
4. **Natural Language**: Use clear, descriptive names for each skill
|
||||
5. **Timestamps**: Use seconds as floats (e.g. 12.5) for all timestamps; the last "end" must be {video_duration_seconds}. If the video has a visible timer in the corner showing elapsed time in seconds, use it to report accurate start and end times for each skill.
|
||||
# Subtask Label Set (Closed Vocabulary)
|
||||
You MUST strictly identify the video segments using ONLY the following labels. Do not create new labels or modify existing ones:
|
||||
|
||||
[
|
||||
{subtask_labels_section}
|
||||
]
|
||||
|
||||
The video shows one successful execution of all subtasks in a logical order.
|
||||
|
||||
# Ground-Truth Semantics (Very Important)
|
||||
Use **visual state changes** to define when a subtask starts and ends. Do NOT assume equal durations for the subtasks.
|
||||
|
||||
- A subtask **starts** at the first frame where the robot's motion clearly initiates that subtask.
|
||||
- A subtask **ends** at the first frame where that specific action is visually completed and the manipulated object reaches a temporary, stable configuration.
|
||||
|
||||
If there are short pauses or micro-motions that don't clearly correspond to a new subtask, they belong to the **current** subtask.
|
||||
|
||||
# Hard Constraints & Logic
|
||||
1. **Continuous Coverage (No Gaps):**
|
||||
- The entire video from 0.0 to {video_duration_seconds} seconds must be covered by subtasks.
|
||||
- There can be no gaps between subtasks.
|
||||
- If there is any idle or ambiguous time between clear actions, extend the *preceding* subtask to cover it.
|
||||
|
||||
2. **Boundary Consistency:**
|
||||
- The `"end"` timestamp of one subtask must be exactly equal to the `"start"` timestamp of the next subtask.
|
||||
- Boundaries must coincide with a real visual state transition, not just a convenient time split.
|
||||
|
||||
3. **Chronological Order, One Occurrence Each:**
|
||||
- This is a single successful demonstration.
|
||||
- Each subtask from the vocabulary appears **exactly once**, in the correct logical order.
|
||||
- **Durations may be very different** between subtasks. Never assume they are similar lengths. Base all boundaries only on the video.
|
||||
|
||||
4. **Reject Uniform Segmentation (Important):**
|
||||
- Do NOT simply divide the video into equal or nearly equal time chunks.
|
||||
- If your boundaries would result in subtasks with similar durations (e.g. all around 5 seconds), treat this as evidence that your segmentation is wrong and refine the boundaries.
|
||||
- Only use nearly equal durations if the video truly shows each subtask taking the same amount of time (this is very rare).
|
||||
|
||||
5. **Timestamps (critical):**
|
||||
- Use numeric seconds (float) in the JSON, e.g. 0.0, 5.2, 12.8.
|
||||
- The first subtask always starts at 0.0.
|
||||
- The last subtask must end at exactly {video_duration_seconds} (the full video length).
|
||||
- **Time is displayed inside the video**: a visible timer in one corner shows the elapsed time in seconds (from 0.0 to the end). Use this on-screen timer to set accurate start and end times for each skill.
|
||||
Format this as a bullet list.
|
||||
|
||||
# Output Format
|
||||
output ONLY valid JSON with this exact structure. The last skill's "end" MUST be exactly {video_duration_seconds}. Use the timestamps you read from the visible timer in the video:
|
||||
|
||||
```json
|
||||
{{
|
||||
"skills": [
|
||||
{{"name": "first skill", "start": 0.0, "end": 5.0}},
|
||||
{{"name": "second skill", "start": 5.0, "end": 12.0}},
|
||||
{{"name": "last skill", "start": 12.0, "end": {video_duration_seconds}}}
|
||||
]
|
||||
}}
|
||||
```
|
||||
|
||||
The first skill must start at 0.0 and the last skill must end at **{video_duration_seconds}** (the total video duration in seconds).
|
||||
# Strict Structural Rule
|
||||
This video contains exactly ALL subtasks given to you.
|
||||
Each segment must use a unique label from the vocabulary.
|
||||
No label may be repeated.
|
||||
|
||||
"""
|
||||
|
||||
@@ -74,7 +74,7 @@ _peft_available = is_package_available("peft")
|
||||
_scipy_available = is_package_available("scipy")
|
||||
_reachy2_sdk_available = is_package_available("reachy2_sdk")
|
||||
_can_available = is_package_available("python-can", "can")
|
||||
_unitree_sdk_available = is_package_available("unitree-sdk2", "unitree_sdk2py")
|
||||
_unitree_sdk_available = is_package_available("unitree-sdk2py", "unitree_sdk2py")
|
||||
_pygame_available = is_package_available("pygame")
|
||||
|
||||
|
||||
|
||||
@@ -23,11 +23,18 @@ These tests verify that:
|
||||
- Subtask handling gracefully handles missing data
|
||||
"""
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import pytest
|
||||
import torch
|
||||
|
||||
from lerobot.data_processing.data_annotations.subtask_annotations import EpisodeSkills, Skill
|
||||
from lerobot.datasets.lerobot_dataset import LeRobotDataset
|
||||
from lerobot.datasets.utils import (
|
||||
create_subtask_index_array,
|
||||
create_subtasks_dataframe,
|
||||
save_subtasks,
|
||||
)
|
||||
|
||||
|
||||
class TestSubtaskDataset:
|
||||
@@ -188,3 +195,164 @@ class TestSubtaskEdgeCases:
|
||||
)
|
||||
else:
|
||||
subtask_map[idx] = subtask
|
||||
|
||||
|
||||
class TestCreateSubtasksDataframe:
|
||||
"""Tests for create_subtasks_dataframe in utils."""
|
||||
|
||||
def test_empty_annotations(self):
|
||||
"""Empty annotations produce empty DataFrame and empty mapping."""
|
||||
subtasks_df, skill_to_subtask_idx = create_subtasks_dataframe({})
|
||||
assert len(subtasks_df) == 0
|
||||
assert list(subtasks_df.columns) == ["subtask_index"]
|
||||
assert skill_to_subtask_idx == {}
|
||||
|
||||
def test_single_episode_single_skill(self):
|
||||
"""Single episode with one skill produces one row and correct mapping."""
|
||||
annotations = {
|
||||
0: EpisodeSkills(
|
||||
episode_index=0,
|
||||
description="Pick",
|
||||
skills=[Skill("pick", 0.0, 1.0)],
|
||||
),
|
||||
}
|
||||
subtasks_df, skill_to_subtask_idx = create_subtasks_dataframe(annotations)
|
||||
assert len(subtasks_df) == 1
|
||||
assert subtasks_df.index.tolist() == ["pick"]
|
||||
assert subtasks_df.loc["pick", "subtask_index"] == 0
|
||||
assert skill_to_subtask_idx == {"pick": 0}
|
||||
|
||||
def test_multiple_episodes_overlapping_skills(self):
|
||||
"""Multiple episodes with overlapping skill names yield unique sorted skills."""
|
||||
annotations = {
|
||||
0: EpisodeSkills(
|
||||
episode_index=0,
|
||||
description="Ep0",
|
||||
skills=[
|
||||
Skill("place", 0.0, 0.5),
|
||||
Skill("pick", 0.5, 1.0),
|
||||
],
|
||||
),
|
||||
1: EpisodeSkills(
|
||||
episode_index=1,
|
||||
description="Ep1",
|
||||
skills=[Skill("pick", 0.0, 1.0)],
|
||||
),
|
||||
}
|
||||
subtasks_df, skill_to_subtask_idx = create_subtasks_dataframe(annotations)
|
||||
# Sorted order: pick, place
|
||||
assert subtasks_df.index.tolist() == ["pick", "place"]
|
||||
assert int(subtasks_df.loc["pick", "subtask_index"]) == 0
|
||||
assert int(subtasks_df.loc["place", "subtask_index"]) == 1
|
||||
assert skill_to_subtask_idx["pick"] == 0
|
||||
assert skill_to_subtask_idx["place"] == 1
|
||||
|
||||
def test_skills_sorted_alphabetically(self):
|
||||
"""Subtask rows are in alphabetical order by skill name."""
|
||||
annotations = {
|
||||
0: EpisodeSkills(
|
||||
episode_index=0,
|
||||
description="Ep",
|
||||
skills=[
|
||||
Skill("z_final", 0.0, 0.33),
|
||||
Skill("a_first", 0.33, 0.66),
|
||||
Skill("m_mid", 0.66, 1.0),
|
||||
],
|
||||
),
|
||||
}
|
||||
subtasks_df, _ = create_subtasks_dataframe(annotations)
|
||||
assert subtasks_df.index.tolist() == ["a_first", "m_mid", "z_final"]
|
||||
assert list(subtasks_df["subtask_index"]) == [0, 1, 2]
|
||||
|
||||
|
||||
class TestSaveSubtasks:
|
||||
"""Tests for save_subtasks in utils."""
|
||||
|
||||
def test_save_subtasks_creates_file(self, tmp_path):
|
||||
"""save_subtasks writes meta/subtasks.parquet and creates parent dir."""
|
||||
subtasks_df = pd.DataFrame(
|
||||
[{"subtask": "pick", "subtask_index": 0}, {"subtask": "place", "subtask_index": 1}]
|
||||
).set_index("subtask")
|
||||
save_subtasks(subtasks_df, tmp_path)
|
||||
out = tmp_path / "meta" / "subtasks.parquet"
|
||||
assert out.exists()
|
||||
read_df = pd.read_parquet(out)
|
||||
pd.testing.assert_frame_equal(read_df.reset_index(), subtasks_df.reset_index())
|
||||
|
||||
def test_save_subtasks_content_matches(self, tmp_path):
|
||||
"""Saved parquet round-trips with same content."""
|
||||
subtasks_df = pd.DataFrame(
|
||||
[{"subtask": "a", "subtask_index": 0}, {"subtask": "b", "subtask_index": 1}]
|
||||
).set_index("subtask")
|
||||
save_subtasks(subtasks_df, tmp_path)
|
||||
read_df = pd.read_parquet(tmp_path / "meta" / "subtasks.parquet")
|
||||
assert read_df.index.tolist() == subtasks_df.index.tolist()
|
||||
assert list(read_df["subtask_index"]) == list(subtasks_df["subtask_index"])
|
||||
|
||||
|
||||
class TestCreateSubtaskIndexArray:
|
||||
"""Tests for create_subtask_index_array in utils."""
|
||||
|
||||
@pytest.fixture
|
||||
def dataset_with_episodes(self, tmp_path, empty_lerobot_dataset_factory):
|
||||
"""Dataset with two episodes (10 frames each) for index-array tests."""
|
||||
features = {"state": {"dtype": "float32", "shape": (2,), "names": None}}
|
||||
dataset = empty_lerobot_dataset_factory(root=tmp_path / "subtask_idx", features=features)
|
||||
for _ in range(10):
|
||||
dataset.add_frame({"state": torch.randn(2), "task": "Task A"})
|
||||
dataset.save_episode()
|
||||
for _ in range(10):
|
||||
dataset.add_frame({"state": torch.randn(2), "task": "Task B"})
|
||||
dataset.save_episode()
|
||||
dataset.finalize()
|
||||
return LeRobotDataset(dataset.repo_id, root=dataset.root)
|
||||
|
||||
def test_unannotated_all_minus_one(self, dataset_with_episodes):
|
||||
"""With no annotations, all frame indices are -1."""
|
||||
skill_to_subtask_idx = {"pick": 0, "place": 1}
|
||||
arr = create_subtask_index_array(dataset_with_episodes, {}, skill_to_subtask_idx)
|
||||
assert len(arr) == len(dataset_with_episodes)
|
||||
assert arr.dtype == np.int64
|
||||
assert np.all(arr == -1)
|
||||
|
||||
def test_annotated_episode_assigns_by_timestamp(self, dataset_with_episodes):
|
||||
"""Frames in an annotated episode get subtask index from skill time ranges."""
|
||||
# Dataset uses DEFAULT_FPS=30. Episode 0: 10 frames -> timestamps 0, 1/30, ..., 9/30 (~0.3s).
|
||||
# Skills: "pick" [0, 0.2), "place" [0.2, 0.5). At 30 fps: 0.2s = 6 frames, so frames 0-5 = pick, 6-9 = place.
|
||||
annotations = {
|
||||
0: EpisodeSkills(
|
||||
episode_index=0,
|
||||
description="Pick and place",
|
||||
skills=[
|
||||
Skill("pick", 0.0, 0.2), # frames 0-5 at 30 fps
|
||||
Skill("place", 0.2, 0.5), # frames 6-9 at 30 fps
|
||||
],
|
||||
),
|
||||
}
|
||||
skill_to_subtask_idx = {"pick": 0, "place": 1}
|
||||
arr = create_subtask_index_array(dataset_with_episodes, annotations, skill_to_subtask_idx)
|
||||
assert len(arr) == 20
|
||||
# Episode 0: from_index=0, to_index=10 at 30 fps
|
||||
for i in range(6):
|
||||
assert arr[i] == 0, f"frame {i} should be pick"
|
||||
for i in range(6, 10):
|
||||
assert arr[i] == 1, f"frame {i} should be place"
|
||||
# Episode 1 not annotated
|
||||
for i in range(10, 20):
|
||||
assert arr[i] == -1
|
||||
|
||||
def test_partial_annotations_leave_others_minus_one(self, dataset_with_episodes):
|
||||
"""Only annotated episodes get non -1 indices; others stay -1."""
|
||||
annotations = {
|
||||
1: EpisodeSkills(
|
||||
episode_index=1,
|
||||
description="Place only",
|
||||
skills=[Skill("place", 0.0, 1.0)],
|
||||
),
|
||||
}
|
||||
skill_to_subtask_idx = {"place": 0}
|
||||
arr = create_subtask_index_array(dataset_with_episodes, annotations, skill_to_subtask_idx)
|
||||
for i in range(10):
|
||||
assert arr[i] == -1
|
||||
for i in range(10, 20):
|
||||
assert arr[i] == 0
|
||||
|
||||
Reference in New Issue
Block a user