Compare commits
26 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 995a46b302 | |||
| 23d4846423 | |||
| 7d897daeb2 | |||
| 7556c7fd70 | |||
| 4434c863b4 | |||
| 4b40153c32 | |||
| f0923e5c86 | |||
| 8edd544bbe | |||
| e682ef05f9 | |||
| 9b5ac4387c | |||
| 5781754c30 | |||
| 18ddc67714 | |||
| b229e7df28 | |||
| 8e05dc9a7a | |||
| fddd044306 | |||
| 522396a15a | |||
| 7e232fb114 | |||
| dc452f37e0 | |||
| 3c11946755 | |||
| 8edbd5b55e | |||
| 025c2b2831 | |||
| c8eee4ea16 | |||
| 9091b68d86 | |||
| 3568df8a35 | |||
| a811945336 | |||
| 0a10d377b5 |
@@ -12,83 +12,57 @@
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
name: "🚀 Issue / Bug / Request"
|
||||
description: Report a bug, suggest an improvement, or ask a technical question.
|
||||
name: "\U0001F41B Bug Report"
|
||||
description: Submit a bug report to help us improve LeRobot
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
### Thanks for contributing to LeRobot! 🙌
|
||||
Please choose the most relevant sections below. If this is a general "how-to" question, consider our [Discord](https://discord.gg/s3KuuzsPFb) for faster community support.
|
||||
|
||||
- type: dropdown
|
||||
id: issue-type
|
||||
attributes:
|
||||
label: Ticket Type
|
||||
description: What kind of ticket are you opening?
|
||||
options:
|
||||
- "🐛 Bug Report (Something isn't working)"
|
||||
- "💡 Feature Request / Improvement"
|
||||
- "❓ Technical Question"
|
||||
- "🧹 Maintenance / Documentation"
|
||||
validations:
|
||||
required: true
|
||||
Thanks for taking the time to submit a bug report! 🐛
|
||||
If this is not a bug related to the LeRobot library directly, but instead a general question about your code or the library specifically please use our [discord](https://discord.gg/s3KuuzsPFb).
|
||||
|
||||
- type: textarea
|
||||
id: system-info
|
||||
attributes:
|
||||
label: Environment & System Info
|
||||
description: |
|
||||
For bugs or technical questions, please run `lerobot-info` and paste the output.
|
||||
(Optional for feature requests).
|
||||
label: System Info
|
||||
description: Please share your LeRobot configuration by running `lerobot-info` (if installed) or `python -m lerobot.scripts.display_sys_info` (if not installed) and pasting the output below.
|
||||
render: Shell
|
||||
placeholder: lerobot version, OS, python version, etc.
|
||||
placeholder: lerobot version, OS, python version, numpy version, torch version, and lerobot's configuration
|
||||
validations:
|
||||
required: true
|
||||
|
||||
- type: checkboxes
|
||||
id: information-scripts-examples
|
||||
attributes:
|
||||
label: Information
|
||||
description: 'The problem arises when using:'
|
||||
options:
|
||||
- label: "One of the scripts in the examples/ folder of LeRobot"
|
||||
- label: "My own task or dataset (give details below)"
|
||||
|
||||
- type: textarea
|
||||
id: description
|
||||
id: reproduction
|
||||
validations:
|
||||
required: true
|
||||
attributes:
|
||||
label: Description
|
||||
label: Reproduction
|
||||
description: |
|
||||
Provide a clear summary of the issue or your proposal.
|
||||
- **Bugs:** What is happening?
|
||||
- **Features:** What is the goal/use case?
|
||||
- **Questions:** What are you trying to achieve?
|
||||
If needed, provide a simple code sample that reproduces the problem you ran into. It can be a Colab link or just a code snippet.
|
||||
Sharing error messages or stack traces could be useful as well!
|
||||
Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
|
||||
Try to avoid screenshots, as they are hard to read and don't allow copy-and-pasting.
|
||||
|
||||
placeholder: |
|
||||
A clear and concise description of the issue or suggestion.
|
||||
Steps to reproduce the behavior:
|
||||
|
||||
1.
|
||||
2.
|
||||
3.
|
||||
|
||||
- type: textarea
|
||||
id: context-repro
|
||||
id: expected-behavior
|
||||
validations:
|
||||
required: true
|
||||
attributes:
|
||||
label: Context & Reproduction
|
||||
description: |
|
||||
Provide a code snippet, steps to reproduce a bug, or technical details about your proposal.
|
||||
Please use code blocks for scripts and CLI commands.
|
||||
placeholder: |
|
||||
Steps to reproduce / Usage example:
|
||||
1.
|
||||
2.
|
||||
3.
|
||||
|
||||
- type: textarea
|
||||
id: logs
|
||||
attributes:
|
||||
label: Relevant logs or stack trace
|
||||
description: If applicable, paste relevant error logs here.
|
||||
render: Shell
|
||||
|
||||
- type: checkboxes
|
||||
id: extras
|
||||
attributes:
|
||||
label: Checklist
|
||||
options:
|
||||
- label: I have searched existing tickets to ensure this isn't a duplicate.
|
||||
- label: I am using the latest version of the `main` branch.
|
||||
- label: I have verified this is not an environment-specific problem.
|
||||
|
||||
- type: textarea
|
||||
id: workaround
|
||||
attributes:
|
||||
label: Additional Info / Workarounds
|
||||
description: Anything else we should know? If you have a workaround, please share it!
|
||||
label: Expected behavior
|
||||
description: "A clear and concise description of what you would expect to happen."
|
||||
|
||||
@@ -1,55 +1,41 @@
|
||||
## Title
|
||||
## What this does
|
||||
|
||||
Short, imperative summary (e.g., "fix(robots): handle None in sensor parser"). See [CONTRIBUTING.md](../CONTRIBUTING.md) for PR conventions.
|
||||
Explain what this PR does. Feel free to tag your PR with the appropriate label(s).
|
||||
|
||||
## Type / Scope
|
||||
Examples:
|
||||
| Title | Label |
|
||||
|----------------------|-----------------|
|
||||
| Fixes #[issue] | (🐛 Bug) |
|
||||
| Adds new dataset | (🗃️ Dataset) |
|
||||
| Optimizes something | (⚡️ Performance) |
|
||||
|
||||
- **Type**: (Bug | Feature | Docs | Performance | Test | CI | Chore)
|
||||
- **Scope**: (optional — name of module or package affected)
|
||||
## How it was tested
|
||||
|
||||
## Summary / Motivation
|
||||
Explain/show how you tested your changes.
|
||||
|
||||
- One-paragraph description of what changes and why.
|
||||
- Why this change is needed and any trade-offs or design notes.
|
||||
Examples:
|
||||
|
||||
## Related issues
|
||||
- Added `test_something` in `tests/test_stuff.py`.
|
||||
- Added `new_feature` and checked that training converges with policy X on dataset/environment Y.
|
||||
- Optimized `some_function`, it now runs X times faster than previously.
|
||||
|
||||
- Fixes / Closes: # (if any)
|
||||
- Related: # (if any)
|
||||
## How to checkout & try? (for the reviewer)
|
||||
|
||||
## What changed
|
||||
Provide a simple way for the reviewer to try out your changes.
|
||||
|
||||
- Short, concrete bullets of the modifications (files/behaviour).
|
||||
- Short note if this introduces breaking changes and migration steps.
|
||||
Examples:
|
||||
|
||||
## How was this tested (or how to run locally)
|
||||
```bash
|
||||
pytest -sx tests/test_stuff.py::test_something
|
||||
```
|
||||
|
||||
- Tests added: list new tests or test files.
|
||||
- Manual checks / dataset runs performed.
|
||||
- Instructions for the reviewer
|
||||
```bash
|
||||
lerobot-train --some.option=true
|
||||
```
|
||||
|
||||
Example:
|
||||
## SECTION TO REMOVE BEFORE SUBMITTING YOUR PR
|
||||
|
||||
- Ran the relevant tests:
|
||||
**Note**: Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
|
||||
members/contributors who may be interested in your PR. Try to avoid tagging more than 3 people.
|
||||
|
||||
```bash
|
||||
pytest -q tests/ -k <keyword>
|
||||
```
|
||||
|
||||
- Reproduce with a quick example or CLI (if applicable):
|
||||
|
||||
```bash
|
||||
lerobot-train --some.option=true
|
||||
```
|
||||
|
||||
## Checklist (required before merge)
|
||||
|
||||
- [ ] Linting/formatting run (`pre-commit run -a`)
|
||||
- [ ] All tests pass locally (`pytest`)
|
||||
- [ ] Documentation updated
|
||||
- [ ] CI is green
|
||||
|
||||
## Reviewer notes
|
||||
|
||||
- Anything the reviewer should focus on (performance, edge-cases, specific files) or general notes.
|
||||
- Anyone in the community is free to review the PR.
|
||||
**Note**: Before submitting this PR, please read the [contributor guideline](https://github.com/huggingface/lerobot/blob/main/CONTRIBUTING.md#submitting-a-pull-request-pr).
|
||||
|
||||
@@ -1,69 +0,0 @@
|
||||
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
CI:
|
||||
- changed-files:
|
||||
- any-glob-to-any-file:
|
||||
- '.github/**'
|
||||
- 'docker/**'
|
||||
|
||||
github_actions:
|
||||
- changed-files:
|
||||
- any-glob-to-any-file: '.github/**'
|
||||
|
||||
documentation:
|
||||
- changed-files:
|
||||
- any-glob-to-any-file:
|
||||
- '**/*.md'
|
||||
- '**/*.mdx'
|
||||
- 'docs/**'
|
||||
|
||||
examples:
|
||||
- changed-files:
|
||||
- any-glob-to-any-file: 'examples/**'
|
||||
|
||||
tests:
|
||||
- changed-files:
|
||||
- any-glob-to-any-file: 'tests/**'
|
||||
|
||||
sensors:
|
||||
- changed-files:
|
||||
- any-glob-to-any-file: 'src/lerobot/cameras/**'
|
||||
|
||||
configuration:
|
||||
- changed-files:
|
||||
- any-glob-to-any-file: 'src/lerobot/configs/**'
|
||||
|
||||
dataset:
|
||||
- changed-files:
|
||||
- any-glob-to-any-file: 'src/lerobot/datasets/**'
|
||||
|
||||
evaluation:
|
||||
- changed-files:
|
||||
- any-glob-to-any-file: 'src/lerobot/envs/**'
|
||||
|
||||
robots:
|
||||
- changed-files:
|
||||
- any-glob-to-any-file:
|
||||
- 'src/lerobot/teleoperators/**'
|
||||
- 'src/lerobot/robots/**'
|
||||
- 'src/lerobot/motors/**'
|
||||
|
||||
policies:
|
||||
- changed-files:
|
||||
- any-glob-to-any-file: 'src/lerobot/policies/**'
|
||||
|
||||
processor:
|
||||
- changed-files:
|
||||
- any-glob-to-any-file: 'src/lerobot/processor/**'
|
||||
@@ -31,8 +31,7 @@ jobs:
|
||||
name: Upload Preview and Comment
|
||||
if: >
|
||||
github.event.workflow_run.event == 'pull_request' &&
|
||||
github.event.workflow_run.conclusion == 'success' &&
|
||||
github.repository == 'huggingface/lerobot'
|
||||
github.event.workflow_run.conclusion == 'success'
|
||||
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
|
||||
with:
|
||||
package_name: lerobot
|
||||
|
||||
@@ -33,9 +33,6 @@ on:
|
||||
paths:
|
||||
- "docs/**"
|
||||
|
||||
release:
|
||||
types: [published]
|
||||
|
||||
# Ensures that only the latest commit for a PR or branch is built, canceling older runs.
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
|
||||
@@ -45,16 +42,14 @@ jobs:
|
||||
# This job builds and deploys the official documentation.
|
||||
build_main_docs:
|
||||
name: Build Main Docs
|
||||
if: >
|
||||
(github.event_name == 'push' || github.event_name == 'workflow_dispatch' || github.event_name == 'release') &&
|
||||
github.repository == 'huggingface/lerobot'
|
||||
if: github.event_name == 'push' || github.event_name == 'workflow_dispatch'
|
||||
permissions:
|
||||
contents: read
|
||||
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
|
||||
with:
|
||||
commit_sha: ${{ github.sha }}
|
||||
package: lerobot
|
||||
additional_args: --not_python_module ${{ github.event_name == 'release' && format('--version {0}', github.event.release.tag_name) || '' }}
|
||||
additional_args: --not_python_module
|
||||
secrets:
|
||||
token: ${{ secrets.HUGGINGFACE_PUSH }}
|
||||
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
|
||||
@@ -63,7 +58,7 @@ jobs:
|
||||
# The result of this job triggers the 'Upload PR Documentation' workflow.
|
||||
build_pr_docs:
|
||||
name: Build PR Docs
|
||||
if: github.event_name == 'pull_request' && github.repository == 'huggingface/lerobot'
|
||||
if: github.event_name == 'pull_request'
|
||||
permissions:
|
||||
contents: read
|
||||
pull-requests: write
|
||||
|
||||
@@ -45,6 +45,7 @@ permissions:
|
||||
env:
|
||||
UV_VERSION: "0.8.0"
|
||||
PYTHON_VERSION: "3.10"
|
||||
DOCKER_IMAGE_NAME: huggingface/lerobot-gpu
|
||||
|
||||
# Ensures that only the latest commit for a PR or branch is built, canceling older runs.
|
||||
concurrency:
|
||||
@@ -62,7 +63,7 @@ jobs:
|
||||
HF_HOME: /mnt/cache/.cache/huggingface
|
||||
HF_LEROBOT_HOME: /mnt/cache/.cache/huggingface/lerobot
|
||||
steps:
|
||||
- uses: actions/checkout@v6
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
persist-credentials: false
|
||||
lfs: true
|
||||
|
||||
@@ -61,7 +61,7 @@ jobs:
|
||||
HF_HOME: /mnt/cache/.cache/huggingface
|
||||
HF_LEROBOT_HOME: /mnt/cache/.cache/huggingface/lerobot
|
||||
steps:
|
||||
- uses: actions/checkout@v6
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
lfs: true
|
||||
persist-credentials: false
|
||||
@@ -85,7 +85,7 @@ jobs:
|
||||
python-version: ${{ env.PYTHON_VERSION }}
|
||||
|
||||
- name: Install lerobot with all extras
|
||||
run: uv sync --extra all # TODO(Steven): Make flash-attn optional
|
||||
run: uv sync --all-extras --no-extra groot # TODO(Steven): Make flash-attn optional
|
||||
|
||||
- name: Run pytest (all extras)
|
||||
run: uv run pytest tests -vv --maxfail=10
|
||||
@@ -127,7 +127,7 @@ jobs:
|
||||
sudo apt-get update
|
||||
sudo apt-get install git-lfs
|
||||
git lfs install
|
||||
- uses: actions/checkout@v6
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
lfs: true
|
||||
persist-credentials: false
|
||||
@@ -186,18 +186,15 @@ jobs:
|
||||
steps:
|
||||
- name: Get Docker Hub Token and Delete Image
|
||||
# zizmor: ignore[template-injection]
|
||||
env:
|
||||
DOCKERHUB_LEROBOT_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
|
||||
DOCKERHUB_LEROBOT_PASSWORD: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
|
||||
IMAGE_FULL: ${{ needs.build-and-push-docker.outputs.image_tag }}
|
||||
run: |
|
||||
IMAGE_NAME=$(echo "$IMAGE_FULL" | cut -d':' -f1)
|
||||
IMAGE_TAG=$(echo "$IMAGE_FULL" | cut -d':' -f2-)
|
||||
IMAGE_NAME=$(echo "${{ needs.build-and-push-docker.outputs.image_tag }}" | cut -d':' -f1)
|
||||
IMAGE_TAG=$(echo "${{ needs.build-and-push-docker.outputs.image_tag }}" | cut -d':' -f2)
|
||||
|
||||
echo "Attempting to delete image: $IMAGE_NAME:$IMAGE_TAG"
|
||||
|
||||
TOKEN=$(curl -s -H "Content-Type: application/json" \
|
||||
-X POST \
|
||||
-d "{\"username\": \"$DOCKERHUB_LEROBOT_USERNAME\", \"password\": \"$DOCKERHUB_LEROBOT_PASSWORD\"}" \
|
||||
-d '{"username": "${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}", "password": "${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}"}' \
|
||||
https://hub.docker.com/v2/users/login/ | jq -r .token)
|
||||
|
||||
if [ "$TOKEN" == "null" ] || [ -z "$TOKEN" ]; then
|
||||
@@ -208,7 +205,7 @@ jobs:
|
||||
HTTP_RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" \
|
||||
-H "Authorization: JWT ${TOKEN}" \
|
||||
-X DELETE \
|
||||
https://hub.docker.com/v2/repositories/${IMAGE_NAME}/tags/$IMAGE_TAG)
|
||||
https://hub.docker.com/v2/repositories/${IMAGE_NAME}/tags/${IMAGE_TAG}/)
|
||||
|
||||
if [ "$HTTP_RESPONSE" -eq 204 ]; then
|
||||
echo "Successfully deleted Docker image tag: $IMAGE_NAME:$IMAGE_TAG"
|
||||
|
||||
@@ -1,77 +0,0 @@
|
||||
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
# This workflow automatically labels issues based on their content.
|
||||
name: Issue Labeler
|
||||
on:
|
||||
# Trigger on new issues and edits to existing issues
|
||||
issues:
|
||||
types: [opened, edited]
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
issues: write
|
||||
|
||||
jobs:
|
||||
label-issue:
|
||||
name: Auto Label Issue
|
||||
runs-on: ubuntu-latest
|
||||
if: github.repository == 'huggingface/lerobot'
|
||||
steps:
|
||||
- uses: actions/github-script@v8
|
||||
with:
|
||||
script: |
|
||||
// Setup Input Text
|
||||
const body = (context.payload.issue.body || '');
|
||||
const title = (context.payload.issue.title || '');
|
||||
const cleanBody = body.replace(/```[\s\S]*?```/g, '');
|
||||
const text = `${title}\n${cleanBody}`.toLowerCase();
|
||||
const labelsToAdd = new Set();
|
||||
const matches = (re) => re.test(text);
|
||||
|
||||
// Keyword Heuristics
|
||||
|
||||
if (matches(/\b(bug|error|crash|exception)\b/i)) labelsToAdd.add('bug');
|
||||
if (matches(/\b(new feature|enhancement|improvement|proposal|feature request)\b/i)) labelsToAdd.add('enhancement');
|
||||
if (matches(/\b(question|how to|clarify|explain|how do i|help me|question about)\b/i)) labelsToAdd.add('question');
|
||||
if (matches(/\b(documentation|docs?|readme|tutorial|wiki|typo|docstring)\b/i)) labelsToAdd.add('documentation');
|
||||
if (matches(/\b(example|sample|demo|notebook)s?\b/i)) labelsToAdd.add('examples');
|
||||
if (matches(/\b(datasets?|data loader|data augmentation|data preprocessing)\b/i)) labelsToAdd.add('dataset');
|
||||
if (matches(/\b(mujoco|isaac|simulation|sim)\b/i)) labelsToAdd.add('simulation');
|
||||
if (matches(/\b(train|training|optimizer|gradient|wandb|sac)\b/i)) labelsToAdd.add('training');
|
||||
if (matches(/\b(rerun|plot|render|rendering|visualizer)/i)) labelsToAdd.add('visualization');
|
||||
if (matches(/\b(cameras?|opencv|realsense|lidars?|sensors?|imus?|microphones?|rgbd|encoders?)\b/i)) labelsToAdd.add('sensors');
|
||||
if (matches(/\b(urdf|actuators?|calibration|end-effector|kinematics)\b/i)) labelsToAdd.add('robots');
|
||||
if (matches(/\b(teleop|teleoperator|controller|leader|follower|joystick|gamepad)\b/i)) labelsToAdd.add('teleoperators');
|
||||
if (matches(/\b(policy|policies|model?)\b/i)) labelsToAdd.add('policies');
|
||||
if (matches(/\b(processor|pipeline|preprocessor|postprocessor)s?\b/i)) labelsToAdd.add('processor');
|
||||
if (matches(/\b(eval|evaluate|evaluation|metrics?|score|benchmarks?)\b/i)) labelsToAdd.add('evaluation');
|
||||
if (matches(/\b(tests?|pytest|unittest|failing test)\b/i)) labelsToAdd.add('tests');
|
||||
if (matches(/\b(ci|github actions?|github workflows?|gha|docker|pypi)\b/i)) labelsToAdd.add('CI');
|
||||
if (matches(/\b(perf|latency|throughput|fps|speed|performance|slow|fast|slower|faster|memory usage)\b/i)) labelsToAdd.add('performance');
|
||||
if (matches(/\b(dependency|dependencies|pip|install error|importerror|package not found|pyproject)\b/i)) labelsToAdd.add('dependencies');
|
||||
if (matches(/\b(configuration|config|arguments?|input feature|dracuss)\b/i)) labelsToAdd.add('configuration');
|
||||
|
||||
// Apply Labels
|
||||
const labels = Array.from(labelsToAdd).filter(Boolean);
|
||||
|
||||
if (labels.length > 0) {
|
||||
console.log(`Adding labels: ${labels.join(', ')}`);
|
||||
await github.rest.issues.addLabels({
|
||||
owner: context.repo.owner,
|
||||
repo: context.repo.repo,
|
||||
issue_number: context.issue.number,
|
||||
labels,
|
||||
});
|
||||
}
|
||||
@@ -43,7 +43,6 @@ jobs:
|
||||
name: Build CPU Docker for Nightly
|
||||
runs-on:
|
||||
group: aws-general-8-plus
|
||||
if: github.repository == 'huggingface/lerobot'
|
||||
outputs:
|
||||
image_tag: ${{ env.DOCKER_IMAGE_NAME_CPU }}
|
||||
steps:
|
||||
@@ -52,7 +51,7 @@ jobs:
|
||||
sudo apt-get update
|
||||
sudo apt-get install git-lfs
|
||||
git lfs install
|
||||
- uses: actions/checkout@v6
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
lfs: true
|
||||
persist-credentials: false
|
||||
@@ -78,7 +77,6 @@ jobs:
|
||||
name: Build GPU Docker for Nightly
|
||||
runs-on:
|
||||
group: aws-general-8-plus
|
||||
if: github.repository == 'huggingface/lerobot'
|
||||
outputs:
|
||||
image_tag: ${{ env.DOCKER_IMAGE_NAME_GPU }}
|
||||
steps:
|
||||
@@ -87,7 +85,7 @@ jobs:
|
||||
sudo apt-get update
|
||||
sudo apt-get install git-lfs
|
||||
git lfs install
|
||||
- uses: actions/checkout@v6
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
lfs: true
|
||||
persist-credentials: false
|
||||
|
||||
@@ -1,39 +0,0 @@
|
||||
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
# This workflow labels pull requests based on the files that were changed.
|
||||
name: Pull Request Labeler
|
||||
|
||||
on:
|
||||
# Allows labeling pull requests when they are opened or updated
|
||||
# zizmor: ignore[dangerous-triggers] Needed to label PRs from forks
|
||||
pull_request_target:
|
||||
branches:
|
||||
- main
|
||||
types: [opened, synchronize, reopened, ready_for_review]
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
pull-requests: write
|
||||
|
||||
jobs:
|
||||
triage:
|
||||
name: Label PR
|
||||
runs-on: ubuntu-latest
|
||||
if: github.repository == 'huggingface/lerobot' && !github.event.pull_request.draft
|
||||
steps:
|
||||
- uses: actions/labeler@v6
|
||||
with:
|
||||
repo-token: ${{ secrets.GITHUB_TOKEN }}
|
||||
sync-labels: true # Removes labels if files are removed from the PR
|
||||
@@ -43,12 +43,12 @@ jobs:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v6
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
persist-credentials: false
|
||||
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v6
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: '3.10'
|
||||
|
||||
|
||||
@@ -29,7 +29,6 @@ jobs:
|
||||
build-and-publish:
|
||||
name: Build and publish Python distributions
|
||||
runs-on: ubuntu-latest
|
||||
if: github.repository == 'huggingface/lerobot'
|
||||
outputs:
|
||||
version: ${{ steps.extract_info.outputs.tag_version }}
|
||||
permissions:
|
||||
@@ -38,12 +37,12 @@ jobs:
|
||||
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v6
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
persist-credentials: false
|
||||
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v6
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: '3.10'
|
||||
|
||||
@@ -135,7 +134,7 @@ jobs:
|
||||
env:
|
||||
MUJOCO_GL: egl
|
||||
steps:
|
||||
- uses: actions/checkout@v6
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
lfs: true
|
||||
persist-credentials: false
|
||||
@@ -177,3 +176,4 @@ jobs:
|
||||
|
||||
# TODO(Steven): Publish draft/pre-release and to test pypi weekly
|
||||
# TODO(Steven): Separate build and publish job
|
||||
# TODO(Steven): Tag documentation with the same version as the package
|
||||
|
||||
@@ -43,7 +43,7 @@ jobs:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v6 # zizmor: ignore[unpinned-uses]
|
||||
uses: actions/checkout@v4 # zizmor: ignore[unpinned-uses]
|
||||
with:
|
||||
fetch-depth: 0
|
||||
persist-credentials: false
|
||||
|
||||
@@ -45,7 +45,6 @@ jobs:
|
||||
stale:
|
||||
name: Close Stale Issues and PRs
|
||||
runs-on: ubuntu-latest
|
||||
if: github.repository == 'huggingface/lerobot'
|
||||
permissions:
|
||||
actions: write
|
||||
contents: write # only for delete-branch option
|
||||
|
||||
@@ -43,13 +43,12 @@ jobs:
|
||||
full-tests:
|
||||
name: Full Unbound Tests
|
||||
runs-on: ubuntu-latest
|
||||
if: github.repository == 'huggingface/lerobot'
|
||||
env:
|
||||
MUJOCO_GL: egl
|
||||
HF_HOME: /mnt/cache/.cache/huggingface
|
||||
HF_LEROBOT_HOME: /mnt/cache/.cache/huggingface/lerobot
|
||||
steps:
|
||||
- uses: actions/checkout@v6
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
lfs: true
|
||||
persist-credentials: false
|
||||
@@ -78,7 +77,7 @@ jobs:
|
||||
echo "Dependencies unbound:" && cat pyproject.toml
|
||||
|
||||
- name: Install lerobot with all extras
|
||||
run: uv sync --extra all # TODO(Steven): Make flash-attn optional
|
||||
run: uv sync --all-extras --no-extra groot # TODO(Steven): Make flash-attn optional
|
||||
|
||||
- name: Run pytest (all extras)
|
||||
run: uv run pytest tests -vv
|
||||
@@ -101,7 +100,7 @@ jobs:
|
||||
sudo apt-get update
|
||||
sudo apt-get install git-lfs
|
||||
git lfs install
|
||||
- uses: actions/checkout@v6
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
lfs: true
|
||||
persist-credentials: false
|
||||
@@ -162,19 +161,15 @@ jobs:
|
||||
steps:
|
||||
- name: Get Docker Hub Token and Delete Image
|
||||
# zizmor: ignore[template-injection]
|
||||
env:
|
||||
DOCKERHUB_LEROBOT_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
|
||||
DOCKERHUB_LEROBOT_PASSWORD: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
|
||||
IMAGE_FULL: ${{ needs.build-and-push-docker.outputs.image_tag }}
|
||||
run: |
|
||||
IMAGE_NAME=$(echo "$IMAGE_FULL" | cut -d':' -f1)
|
||||
IMAGE_TAG=$(echo "$IMAGE_FULL" | cut -d':' -f2)
|
||||
IMAGE_NAME=$(echo "${{ needs.build-and-push-docker.outputs.image_tag }}" | cut -d':' -f1)
|
||||
IMAGE_TAG=$(echo "${{ needs.build-and-push-docker.outputs.image_tag }}" | cut -d':' -f2)
|
||||
|
||||
echo "Attempting to delete image: $IMAGE_NAME:$IMAGE_TAG"
|
||||
|
||||
TOKEN=$(curl -s -H "Content-Type: application/json" \
|
||||
-X POST \
|
||||
-d "{\"username\": \"$DOCKERHUB_LEROBOT_USERNAME\", \"password\": \"$DOCKERHUB_LEROBOT_PASSWORD\"}" \
|
||||
-d '{"username": "${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}", "password": "${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}"}' \
|
||||
https://hub.docker.com/v2/users/login/ | jq -r .token)
|
||||
|
||||
if [ "$TOKEN" == "null" ] || [ -z "$TOKEN" ]; then
|
||||
@@ -185,7 +180,7 @@ jobs:
|
||||
HTTP_RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" \
|
||||
-H "Authorization: JWT ${TOKEN}" \
|
||||
-X DELETE \
|
||||
https://hub.docker.com/v2/repositories/${IMAGE_NAME}/tags/$IMAGE_TAG)
|
||||
https://hub.docker.com/v2/repositories/${IMAGE_NAME}/tags/${IMAGE_TAG}/)
|
||||
|
||||
if [ "$HTTP_RESPONSE" -eq 204 ]; then
|
||||
echo "Successfully deleted Docker image tag: $IMAGE_NAME:$IMAGE_TAG"
|
||||
|
||||
@@ -87,7 +87,7 @@ repos:
|
||||
# TODO(Steven): Uncomment when ready to use
|
||||
##### Static Analysis & Typing #####
|
||||
- repo: https://github.com/pre-commit/mirrors-mypy
|
||||
rev: v1.19.1
|
||||
rev: v1.18.2
|
||||
hooks:
|
||||
- id: mypy
|
||||
args: [--config-file=pyproject.toml]
|
||||
|
||||
@@ -52,7 +52,7 @@ decisions when appropriate.
|
||||
|
||||
This Code of Conduct applies within all community spaces, and also applies when
|
||||
an individual is officially representing the community in public spaces.
|
||||
Examples of representing our community include using an official e-mail address,
|
||||
Examples of representing our community include using an official email address,
|
||||
posting via an official social media account, or acting as an appointed
|
||||
representative at an online or offline event.
|
||||
|
||||
@@ -60,7 +60,7 @@ representative at an online or offline event.
|
||||
|
||||
Instances of abusive, harassing, or otherwise unacceptable behavior may be
|
||||
reported to the community leaders responsible for enforcement at
|
||||
feedback@huggingface.co.
|
||||
[feedback@huggingface.co](mailto:feedback@huggingface.co).
|
||||
All complaints will be reviewed and investigated promptly and fairly.
|
||||
|
||||
All community leaders are obligated to respect the privacy and security of the
|
||||
|
||||
@@ -1,83 +1,323 @@
|
||||
# How to contribute to 🤗 LeRobot
|
||||
# How to contribute to 🤗 LeRobot?
|
||||
|
||||
Everyone is welcome to contribute, and we value everybody's contribution. Code is not the only way to help the community. Answering questions, helping others, reaching out, and improving the documentation are immensely valuable.
|
||||
Everyone is welcome to contribute, and we value everybody's contribution. Code
|
||||
is thus not the only way to help the community. Answering questions, helping
|
||||
others, reaching out and improving the documentations are immensely valuable to
|
||||
the community.
|
||||
|
||||
Whichever way you choose to contribute, please be mindful to respect our [code of conduct](./CODE_OF_CONDUCT.md).
|
||||
It also helps us if you spread the word: reference the library from blog posts
|
||||
on the awesome projects it made possible, shout out on Twitter when it has
|
||||
helped you, or simply ⭐️ the repo to say "thank you".
|
||||
|
||||
## Ways to Contribute
|
||||
Whichever way you choose to contribute, please be mindful to respect our
|
||||
[code of conduct](https://github.com/huggingface/lerobot/blob/main/CODE_OF_CONDUCT.md).
|
||||
|
||||
You can contribute in many ways:
|
||||
## You can contribute in so many ways!
|
||||
|
||||
- **Fixing issues:** Resolve bugs or improve existing code.
|
||||
- **New features:** Develop new features.
|
||||
- **Extend:** Implement new models/policies, robots, or simulation environments and upload datasets to the Hugging Face Hub.
|
||||
- **Documentation:** Improve examples, guides, and docstrings.
|
||||
- **Feedback:** Submit tickets related to bugs or desired new features.
|
||||
Some of the ways you can contribute to 🤗 LeRobot:
|
||||
|
||||
If you are unsure where to start, join our [Discord Channel](https://discord.gg/JkrYNdmw).
|
||||
- Fixing outstanding issues with the existing code.
|
||||
- Implementing new models, datasets or simulation environments.
|
||||
- Contributing to the examples or to the documentation.
|
||||
- Submitting issues related to bugs or desired new features.
|
||||
|
||||
## Development Setup
|
||||
Following the guides below, feel free to open issues and PRs and to coordinate your efforts with the community on our [Discord Channel](https://discord.gg/VjFz58wn3R). For specific inquiries, reach out to [Remi Cadene](mailto:remi.cadene@huggingface.co).
|
||||
|
||||
To contribute code, you need to set up a development environment.
|
||||
If you are not sure how to contribute or want to know the next features we working on, look on this project page: [LeRobot TODO](https://github.com/orgs/huggingface/projects/46)
|
||||
|
||||
### 1. Fork and Clone
|
||||
## Submitting a new issue or feature request
|
||||
|
||||
Fork the repository on GitHub, then clone your fork:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/<your-handle>/lerobot.git
|
||||
cd lerobot
|
||||
git remote add upstream https://github.com/huggingface/lerobot.git
|
||||
```
|
||||
|
||||
### 2. Environment Installation
|
||||
|
||||
Please follow our [Installation Guide](./docs/source/installation.mdx) for the environment setup & installation from source.
|
||||
|
||||
## Running Tests & Quality Checks
|
||||
|
||||
### Code Style (Pre-commit)
|
||||
|
||||
Install `pre-commit` hooks to run checks automatically before you commit:
|
||||
|
||||
```bash
|
||||
pre-commit install
|
||||
```
|
||||
|
||||
To run checks manually on all files:
|
||||
|
||||
```bash
|
||||
pre-commit run --all-files
|
||||
```
|
||||
|
||||
### Running Tests
|
||||
|
||||
We use `pytest`. First, ensure you have test artifacts by installing **git-lfs**:
|
||||
Do your best to follow these guidelines when submitting an issue or a feature
|
||||
request. It will make it easier for us to come back to you quickly and with good
|
||||
feedback.
|
||||
|
||||
### Did you find a bug?
|
||||
|
||||
The 🤗 LeRobot library is robust and reliable thanks to the users who notify us of
|
||||
the problems they encounter. So thank you for reporting an issue.
|
||||
|
||||
First, we would really appreciate it if you could **make sure the bug was not
|
||||
already reported** (use the search bar on Github under Issues).
|
||||
|
||||
Did not find it? :( So we can act quickly on it, please follow these steps:
|
||||
|
||||
- Include your **OS type and version**, the versions of **Python** and **PyTorch**.
|
||||
- A short, self-contained, code snippet that allows us to reproduce the bug in
|
||||
less than 30s.
|
||||
- The full traceback if an exception is raised.
|
||||
- Attach any other additional information, like screenshots, you think may help.
|
||||
|
||||
### Do you want a new feature?
|
||||
|
||||
A good feature request addresses the following points:
|
||||
|
||||
1. Motivation first:
|
||||
|
||||
- Is it related to a problem/frustration with the library? If so, please explain
|
||||
why. Providing a code snippet that demonstrates the problem is best.
|
||||
- Is it related to something you would need for a project? We'd love to hear
|
||||
about it!
|
||||
- Is it something you worked on and think could benefit the community?
|
||||
Awesome! Tell us what problem it solved for you.
|
||||
|
||||
2. Write a _paragraph_ describing the feature.
|
||||
3. Provide a **code snippet** that demonstrates its future use.
|
||||
4. In case this is related to a paper, please attach a link.
|
||||
5. Attach any additional information (drawings, screenshots, etc.) you think may help.
|
||||
|
||||
If your issue is well written we're already 80% of the way there by the time you
|
||||
post it.
|
||||
|
||||
## Adding new policies, datasets or environments
|
||||
|
||||
Look at our implementations for [datasets](./src/lerobot/datasets/), [policies](./src/lerobot/policies/),
|
||||
environments ([aloha](https://github.com/huggingface/gym-aloha),
|
||||
[pusht](https://github.com/huggingface/gym-pusht))
|
||||
and follow the same api design.
|
||||
|
||||
When implementing a new dataset loadable with LeRobotDataset follow these steps:
|
||||
|
||||
- Update `available_datasets_per_env` in `lerobot/__init__.py`
|
||||
|
||||
When implementing a new environment (e.g. `gym_aloha`), follow these steps:
|
||||
|
||||
- Update `available_tasks_per_env` and `available_datasets_per_env` in `lerobot/__init__.py`
|
||||
|
||||
When implementing a new policy class (e.g. `DiffusionPolicy`) follow these steps:
|
||||
|
||||
- Update `available_policies` and `available_policies_per_env`, in `lerobot/__init__.py`
|
||||
- Set the required `name` class attribute.
|
||||
- Update variables in `tests/test_available.py` by importing your new Policy class
|
||||
|
||||
## Submitting a pull request (PR)
|
||||
|
||||
Before writing code, we strongly advise you to search through the existing PRs or
|
||||
issues to make sure that nobody is already working on the same thing. If you are
|
||||
unsure, it is always a good idea to open an issue to get some feedback.
|
||||
|
||||
You will need basic `git` proficiency to be able to contribute to
|
||||
🤗 LeRobot. `git` is not the easiest tool to use but it has the greatest
|
||||
manual. Type `git --help` in a shell and enjoy. If you prefer books, [Pro
|
||||
Git](https://git-scm.com/book/en/v2) is a very good reference.
|
||||
|
||||
Follow these steps to start contributing:
|
||||
|
||||
1. Fork the [repository](https://github.com/huggingface/lerobot) by
|
||||
clicking on the 'Fork' button on the repository's page. This creates a copy of the code
|
||||
under your GitHub user account.
|
||||
|
||||
2. Clone your fork to your local disk, and add the base repository as a remote. The following command
|
||||
assumes you have your public SSH key uploaded to GitHub. See the following guide for more
|
||||
[information](https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository).
|
||||
|
||||
```bash
|
||||
git clone git@github.com:<your Github handle>/lerobot.git
|
||||
cd lerobot
|
||||
git remote add upstream https://github.com/huggingface/lerobot.git
|
||||
```
|
||||
|
||||
3. Create a new branch to hold your development changes, and do this for every new PR you work on.
|
||||
|
||||
Start by synchronizing your `main` branch with the `upstream/main` branch (more details in the [GitHub Docs](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/syncing-a-fork)):
|
||||
|
||||
```bash
|
||||
git checkout main
|
||||
git fetch upstream
|
||||
git rebase upstream/main
|
||||
```
|
||||
|
||||
Once your `main` branch is synchronized, create a new branch from it:
|
||||
|
||||
```bash
|
||||
git checkout -b a-descriptive-name-for-my-changes
|
||||
```
|
||||
|
||||
🚨 **Do not** work on the `main` branch.
|
||||
|
||||
4. for development, we advise to use a tool like `poetry` or `uv` instead of just `pip` to easily track our dependencies.
|
||||
Follow the instructions to [install poetry](https://python-poetry.org/docs/#installation) (use a version >=2.1.0) or to [install uv](https://docs.astral.sh/uv/getting-started/installation/#installation-methods) if you don't have one of them already.
|
||||
|
||||
Set up a development environment with conda:
|
||||
|
||||
```bash
|
||||
conda create -y -n lerobot-dev python=3.10 && conda activate lerobot-dev
|
||||
```
|
||||
|
||||
If you're using `uv`, it can manage python versions so you can instead do:
|
||||
|
||||
```bash
|
||||
uv venv --python 3.10 && source .venv/bin/activate
|
||||
```
|
||||
|
||||
To develop on 🤗 LeRobot, you will at least need to install the `dev` and `test` extras dependencies along with the core library:
|
||||
|
||||
using `poetry`
|
||||
|
||||
```bash
|
||||
poetry sync --extras "dev test"
|
||||
```
|
||||
|
||||
using `uv`
|
||||
|
||||
```bash
|
||||
uv sync --extra dev --extra test
|
||||
```
|
||||
|
||||
You can also install the project with all its dependencies (including environments):
|
||||
|
||||
using `poetry`
|
||||
|
||||
```bash
|
||||
poetry sync --all-extras
|
||||
```
|
||||
|
||||
using `uv`
|
||||
|
||||
```bash
|
||||
uv sync --all-extras
|
||||
```
|
||||
|
||||
> **Note:** If you don't install simulation environments with `--all-extras`, the tests that require them will be skipped when running the pytest suite locally. However, they _will_ be tested in the CI. In general, we advise you to install everything and test locally before pushing.
|
||||
|
||||
Whichever command you chose to install the project (e.g. `poetry sync --all-extras`), you should run it again when pulling code with an updated version of `pyproject.toml` and `poetry.lock` in order to synchronize your virtual environment with the new dependencies.
|
||||
|
||||
The equivalent of `pip install some-package`, would just be:
|
||||
|
||||
using `poetry`
|
||||
|
||||
```bash
|
||||
poetry add some-package
|
||||
```
|
||||
|
||||
using `uv`
|
||||
|
||||
```bash
|
||||
uv add some-package
|
||||
```
|
||||
|
||||
When making changes to the poetry sections of the `pyproject.toml`, you should run the following command to lock dependencies.
|
||||
using `poetry`
|
||||
|
||||
```bash
|
||||
poetry lock
|
||||
```
|
||||
|
||||
using `uv`
|
||||
|
||||
```bash
|
||||
uv lock
|
||||
```
|
||||
|
||||
5. Develop the features on your branch.
|
||||
|
||||
As you work on the features, you should make sure that the test suite
|
||||
passes. You should run the tests impacted by your changes like this (see
|
||||
below an explanation regarding the environment variable):
|
||||
|
||||
```bash
|
||||
pytest tests/<TEST_TO_RUN>.py
|
||||
```
|
||||
|
||||
6. Follow our style.
|
||||
|
||||
`lerobot` relies on `ruff` to format its source code
|
||||
consistently. Set up [`pre-commit`](https://pre-commit.com/) to run these checks
|
||||
automatically as Git commit hooks.
|
||||
|
||||
Install `pre-commit` hooks:
|
||||
|
||||
```bash
|
||||
pre-commit install
|
||||
```
|
||||
|
||||
You can run these hooks whenever you need on staged files with:
|
||||
|
||||
```bash
|
||||
pre-commit
|
||||
```
|
||||
|
||||
Once you're happy with your changes, add changed files using `git add` and
|
||||
make a commit with `git commit` to record your changes locally:
|
||||
|
||||
```bash
|
||||
git add modified_file.py
|
||||
git commit
|
||||
```
|
||||
|
||||
Note, if you already committed some changes that have a wrong formatting, you can use:
|
||||
|
||||
```bash
|
||||
pre-commit run --all-files
|
||||
```
|
||||
|
||||
Please write [good commit messages](https://chris.beams.io/posts/git-commit/).
|
||||
|
||||
It is a good idea to sync your copy of the code with the original
|
||||
repository regularly. This way you can quickly account for changes:
|
||||
|
||||
```bash
|
||||
git fetch upstream
|
||||
git rebase upstream/main
|
||||
```
|
||||
|
||||
Push the changes to your account using:
|
||||
|
||||
```bash
|
||||
git push -u origin a-descriptive-name-for-my-changes
|
||||
```
|
||||
|
||||
7. Once you are satisfied (**and the checklist below is happy too**), go to the
|
||||
webpage of your fork on GitHub. Click on 'Pull request' to send your changes
|
||||
to the project maintainers for review.
|
||||
|
||||
8. It's ok if maintainers ask you for changes. It happens to core contributors
|
||||
too! So everyone can see the changes in the Pull request, work in your local
|
||||
branch and push the changes to your fork. They will automatically appear in
|
||||
the pull request.
|
||||
|
||||
### Checklist
|
||||
|
||||
1. The title of your pull request should be a summary of its contribution;
|
||||
2. If your pull request addresses an issue, please mention the issue number in
|
||||
the pull request description to make sure they are linked (and people
|
||||
consulting the issue know you are working on it);
|
||||
3. To indicate a work in progress please prefix the title with `[WIP]`, or preferably mark
|
||||
the PR as a draft PR. These are useful to avoid duplicated work, and to differentiate
|
||||
it from PRs ready to be merged;
|
||||
4. Make sure existing tests pass;
|
||||
|
||||
### Tests
|
||||
|
||||
An extensive test suite is included to test the library behavior and several examples. Library tests can be found in the [tests folder](https://github.com/huggingface/lerobot/tree/main/tests).
|
||||
|
||||
Install [git lfs](https://git-lfs.com/) to retrieve test artifacts (if you don't have it already).
|
||||
|
||||
On Mac:
|
||||
|
||||
```bash
|
||||
brew install git-lfs
|
||||
git lfs install
|
||||
```
|
||||
|
||||
On Ubuntu:
|
||||
|
||||
```bash
|
||||
sudo apt-get install git-lfs
|
||||
git lfs install
|
||||
```
|
||||
|
||||
Pull artifacts if they're not in [tests/artifacts](tests/artifacts)
|
||||
|
||||
```bash
|
||||
git lfs pull
|
||||
```
|
||||
|
||||
Run the full suite (this may require extras installed):
|
||||
We use `pytest` in order to run the tests. From the root of the
|
||||
repository, here's how to run tests with `pytest` for the library:
|
||||
|
||||
```bash
|
||||
pytest -sv ./tests
|
||||
python -m pytest -sv ./tests
|
||||
```
|
||||
|
||||
Or run a specific test file during development:
|
||||
|
||||
```bash
|
||||
pytest -sv tests/test_specific_feature.py
|
||||
```
|
||||
|
||||
## Submitting Issues & Pull Requests
|
||||
|
||||
Use the templates for required fields and examples.
|
||||
|
||||
- **Issues:** Follow the [ticket template](./.github/ISSUE_TEMPLATE/bug-report.yml).
|
||||
- **Pull requests:** Rebase on `upstream/main`, use a descriptive branch (don't work on `main`), run `pre-commit` and tests locally, and follow the [PR template](./.github/PULL_REQUEST_TEMPLATE.md).
|
||||
|
||||
One member of the LeRobot team will then review your contribution.
|
||||
|
||||
Thank you for contributing to LeRobot!
|
||||
You can specify a smaller set of tests in order to test only the feature
|
||||
you're working on.
|
||||
|
||||
@@ -1,5 +1,7 @@
|
||||
<p align="center">
|
||||
<img alt="LeRobot, Hugging Face Robotics Library" src="./media/readme/lerobot-logo-thumbnail.png" width="100%">
|
||||
<img alt="LeRobot, Hugging Face Robotics Library" src="https://raw.githubusercontent.com/huggingface/lerobot/main/media/lerobot-logo-thumbnail.png" width="100%">
|
||||
<br/>
|
||||
<br/>
|
||||
</p>
|
||||
|
||||
<div align="center">
|
||||
@@ -10,131 +12,323 @@
|
||||
[](https://pypi.org/project/lerobot/)
|
||||
[](https://pypi.org/project/lerobot/)
|
||||
[](https://github.com/huggingface/lerobot/blob/main/CODE_OF_CONDUCT.md)
|
||||
[](https://discord.gg/q8Dzzpym3f)
|
||||
[](https://discord.gg/s3KuuzsPFb)
|
||||
|
||||
<!-- [](https://codecov.io/gh/huggingface/lerobot) -->
|
||||
|
||||
</div>
|
||||
|
||||
**LeRobot** aims to provide models, datasets, and tools for real-world robotics in PyTorch. The goal is to lower the barrier to entry so that everyone can contribute to and benefit from shared datasets and pretrained models.
|
||||
<h2 align="center">
|
||||
<p><a href="https://huggingface.co/docs/lerobot/hope_jr">
|
||||
Build Your Own HopeJR Robot!</a></p>
|
||||
</h2>
|
||||
|
||||
🤗 A hardware-agnostic, Python-native interface that standardizes control across diverse platforms, from low-cost arms (SO-100) to humanoids.
|
||||
<div align="center">
|
||||
<img
|
||||
src="https://raw.githubusercontent.com/huggingface/lerobot/main/media/hope_jr/hopejr.png"
|
||||
alt="HopeJR robot"
|
||||
title="HopeJR robot"
|
||||
width="60%"
|
||||
/>
|
||||
|
||||
🤗 A standardized, scalable LeRobotDataset format (Parquet + MP4 or images) hosted on the Hugging Face Hub, enabling efficient storage, streaming and visualization of massive robotic datasets.
|
||||
<p><strong>Meet HopeJR – A humanoid robot arm and hand for dexterous manipulation!</strong></p>
|
||||
<p>Control it with exoskeletons and gloves for precise hand movements.</p>
|
||||
<p>Perfect for advanced manipulation tasks! 🤖</p>
|
||||
|
||||
🤗 State-of-the-art policies that have been shown to transfer to the real-world ready for training and deployment.
|
||||
<p><a href="https://huggingface.co/docs/lerobot/hope_jr">
|
||||
See the full HopeJR tutorial here.</a></p>
|
||||
</div>
|
||||
|
||||
🤗 Comprehensive support for the open-source ecosystem to democratize physical AI.
|
||||
<br/>
|
||||
|
||||
## Quick Start
|
||||
<h2 align="center">
|
||||
<p><a href="https://huggingface.co/docs/lerobot/so101">
|
||||
Build Your Own SO-101 Robot!</a></p>
|
||||
</h2>
|
||||
|
||||
LeRobot can be installed directly from PyPI.
|
||||
<div align="center">
|
||||
<table>
|
||||
<tr>
|
||||
<td align="center"><img src="https://raw.githubusercontent.com/huggingface/lerobot/main/media/so101/so101.webp" alt="SO-101 follower arm" title="SO-101 follower arm" width="90%"/></td>
|
||||
<td align="center"><img src="https://raw.githubusercontent.com/huggingface/lerobot/main/media/so101/so101-leader.webp" alt="SO-101 leader arm" title="SO-101 leader arm" width="90%"/></td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
<p><strong>Meet the updated SO100, the SO-101 – Just €114 per arm!</strong></p>
|
||||
<p>Train it in minutes with a few simple moves on your laptop.</p>
|
||||
<p>Then sit back and watch your creation act autonomously! 🤯</p>
|
||||
|
||||
<p><a href="https://huggingface.co/docs/lerobot/so101">
|
||||
See the full SO-101 tutorial here.</a></p>
|
||||
|
||||
<p>Want to take it to the next level? Make your SO-101 mobile by building LeKiwi!</p>
|
||||
<p>Check out the <a href="https://huggingface.co/docs/lerobot/lekiwi">LeKiwi tutorial</a> and bring your robot to life on wheels.</p>
|
||||
|
||||
<img src="https://raw.githubusercontent.com/huggingface/lerobot/main/media/lekiwi/kiwi.webp" alt="LeKiwi mobile robot" title="LeKiwi mobile robot" width="50%">
|
||||
</div>
|
||||
|
||||
<br/>
|
||||
|
||||
<h3 align="center">
|
||||
<p>LeRobot: State-of-the-art AI for real-world robotics</p>
|
||||
</h3>
|
||||
|
||||
---
|
||||
|
||||
🤗 LeRobot aims to provide models, datasets, and tools for real-world robotics in PyTorch. The goal is to lower the barrier to entry to robotics so that everyone can contribute and benefit from sharing datasets and pretrained models.
|
||||
|
||||
🤗 LeRobot contains state-of-the-art approaches that have been shown to transfer to the real-world with a focus on imitation learning and reinforcement learning.
|
||||
|
||||
🤗 LeRobot already provides a set of pretrained models, datasets with human collected demonstrations, and simulation environments to get started without assembling a robot. In the coming weeks, the plan is to add more and more support for real-world robotics on the most affordable and capable robots out there.
|
||||
|
||||
🤗 LeRobot hosts pretrained models and datasets on this Hugging Face community page: [huggingface.co/lerobot](https://huggingface.co/lerobot)
|
||||
|
||||
#### Examples of pretrained models on simulation environments
|
||||
|
||||
<table>
|
||||
<tr>
|
||||
<td><img src="https://raw.githubusercontent.com/huggingface/lerobot/main/media/gym/aloha_act.gif" width="100%" alt="ACT policy on ALOHA env"/></td>
|
||||
<td><img src="https://raw.githubusercontent.com/huggingface/lerobot/main/media/gym/simxarm_tdmpc.gif" width="100%" alt="TDMPC policy on SimXArm env"/></td>
|
||||
<td><img src="https://raw.githubusercontent.com/huggingface/lerobot/main/media/gym/pusht_diffusion.gif" width="100%" alt="Diffusion policy on PushT env"/></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td align="center">ACT policy on ALOHA env</td>
|
||||
<td align="center">TDMPC policy on SimXArm env</td>
|
||||
<td align="center">Diffusion policy on PushT env</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
## Installation
|
||||
|
||||
LeRobot works with Python 3.10+ and PyTorch 2.2+.
|
||||
|
||||
### Environment Setup
|
||||
|
||||
Create a virtual environment with Python 3.10 and activate it, e.g. with [`miniforge`](https://conda-forge.org/download/):
|
||||
|
||||
```bash
|
||||
conda create -y -n lerobot python=3.10
|
||||
conda activate lerobot
|
||||
```
|
||||
|
||||
When using `conda`, install `ffmpeg` in your environment:
|
||||
|
||||
```bash
|
||||
conda install ffmpeg -c conda-forge
|
||||
```
|
||||
|
||||
> **NOTE:** This usually installs `ffmpeg 7.X` for your platform compiled with the `libsvtav1` encoder. If `libsvtav1` is not supported (check supported encoders with `ffmpeg -encoders`), you can:
|
||||
>
|
||||
> - _[On any platform]_ Explicitly install `ffmpeg 7.X` using:
|
||||
>
|
||||
> ```bash
|
||||
> conda install ffmpeg=7.1.1 -c conda-forge
|
||||
> ```
|
||||
>
|
||||
> - _[On Linux only]_ Install [ffmpeg build dependencies](https://trac.ffmpeg.org/wiki/CompilationGuide/Ubuntu#GettheDependencies) and [compile ffmpeg from source with libsvtav1](https://trac.ffmpeg.org/wiki/CompilationGuide/Ubuntu#libsvtav1), and make sure you use the corresponding ffmpeg binary to your install with `which ffmpeg`.
|
||||
|
||||
### Install LeRobot 🤗
|
||||
|
||||
#### From Source
|
||||
|
||||
First, clone the repository and navigate into the directory:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/huggingface/lerobot.git
|
||||
cd lerobot
|
||||
```
|
||||
|
||||
Then, install the library in editable mode. This is useful if you plan to contribute to the code.
|
||||
|
||||
```bash
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
> **NOTE:** If you encounter build errors, you may need to install additional dependencies (`cmake`, `build-essential`, and `ffmpeg libs`). On Linux, run:
|
||||
> `sudo apt-get install cmake build-essential python3-dev pkg-config libavformat-dev libavcodec-dev libavdevice-dev libavutil-dev libswscale-dev libswresample-dev libavfilter-dev`. For other systems, see: [Compiling PyAV](https://pyav.org/docs/develop/overview/installation.html#bring-your-own-ffmpeg)
|
||||
|
||||
For simulations, 🤗 LeRobot comes with gymnasium environments that can be installed as extras:
|
||||
|
||||
- [aloha](https://github.com/huggingface/gym-aloha)
|
||||
- [xarm](https://github.com/huggingface/gym-xarm)
|
||||
- [pusht](https://github.com/huggingface/gym-pusht)
|
||||
|
||||
For instance, to install 🤗 LeRobot with aloha and pusht, use:
|
||||
|
||||
```bash
|
||||
pip install -e ".[aloha, pusht]"
|
||||
```
|
||||
|
||||
### Installation from PyPI
|
||||
|
||||
**Core Library:**
|
||||
Install the base package with:
|
||||
|
||||
```bash
|
||||
pip install lerobot
|
||||
lerobot-info
|
||||
```
|
||||
|
||||
> [!IMPORTANT]
|
||||
> For detailed installation guide, please see the [Installation Documentation](https://huggingface.co/docs/lerobot/installation).
|
||||
_This installs only the default dependencies._
|
||||
|
||||
## Robots & Control
|
||||
|
||||
<div align="center">
|
||||
<img src="./media/readme/robots_control_video.webp" width="640px" alt="Reachy 2 Demo">
|
||||
</div>
|
||||
|
||||
LeRobot provides a unified `Robot` class interface that decouples control logic from hardware specifics. It supports a wide range of robots and teleoperation devices.
|
||||
|
||||
```python
|
||||
from lerobot.robots.myrobot import MyRobot
|
||||
|
||||
# Connect to a robot
|
||||
robot = MyRobot(config=...)
|
||||
robot.connect()
|
||||
|
||||
# Read observation and send action
|
||||
obs = robot.get_observation()
|
||||
action = model.select_action(obs)
|
||||
robot.send_action(action)
|
||||
```
|
||||
|
||||
**Supported Hardware:** SO100, LeKiwi, Koch, HopeJR, OMX, EarthRover, Reachy2, Gamepads, Keyboards, Phones, OpenARM, Unitree G1.
|
||||
|
||||
While these devices are natively integrated into the LeRobot codebase, the library is designed to be extensible. You can easily implement the Robot interface to utilize LeRobot's data collection, training, and visualization tools for your own custom robot.
|
||||
|
||||
For detailed hardware setup guides, see the [Hardware Documentation](https://huggingface.co/docs/lerobot/integrate_hardware).
|
||||
|
||||
## LeRobot Dataset
|
||||
|
||||
To solve the data fragmentation problem in robotics, we utilize the **LeRobotDataset** format.
|
||||
|
||||
- **Structure:** Synchronized MP4 videos (or images) for vision and Parquet files for state/action data.
|
||||
- **HF Hub Integration:** Explore thousands of robotics datasets on the [Hugging Face Hub](https://huggingface.co/lerobot).
|
||||
- **Tools:** Seamlessly delete episodes, split by indices/fractions, add/remove features, and merge multiple datasets.
|
||||
|
||||
```python
|
||||
from lerobot.datasets.lerobot_dataset import LeRobotDataset
|
||||
|
||||
# Load a dataset from the Hub
|
||||
dataset = LeRobotDataset("lerobot/aloha_mobile_cabinet")
|
||||
|
||||
# Access data (automatically handles video decoding)
|
||||
episode_index=0
|
||||
print(f"{dataset[episode_index]['action'].shape=}\n")
|
||||
```
|
||||
|
||||
Learn more about it in the [LeRobotDataset Documentation](https://huggingface.co/docs/lerobot/lerobot-dataset-v3)
|
||||
|
||||
## SoTA Models
|
||||
|
||||
LeRobot implements state-of-the-art policies in pure PyTorch, covering Imitation Learning, Reinforcement Learning, and Vision-Language-Action (VLA) models, with more coming soon. It also provides you with the tools to instrument and inspect your training process.
|
||||
|
||||
<p align="center">
|
||||
<img alt="Gr00t Architecture" src="./media/readme/VLA_architecture.jpg" width="640px">
|
||||
</p>
|
||||
|
||||
Training a policy is as simple as running a script configuration:
|
||||
**Extra Features:**
|
||||
To install additional functionality, use one of the following:
|
||||
|
||||
```bash
|
||||
lerobot-train \
|
||||
--policy=act \
|
||||
--dataset.repo_id=lerobot/aloha_mobile_cabinet
|
||||
pip install 'lerobot[all]' # All available features
|
||||
pip install 'lerobot[aloha,pusht]' # Specific features (Aloha & Pusht)
|
||||
pip install 'lerobot[feetech]' # Feetech motor support
|
||||
```
|
||||
|
||||
| Category | Models |
|
||||
| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| **Imitation Learning** | [ACT](./docs/source/policy_act_README.md), [Diffusion](./docs/source/policy_diffusion_README.md), [VQ-BeT](./docs/source/policy_vqbet_README.md) |
|
||||
| **Reinforcement Learning** | [HIL-SERL](./docs/source/hilserl.mdx), [TDMPC](./docs/source/policy_tdmpc_README.md) & QC-FQL (coming soon) |
|
||||
| **VLAs Models** | [Pi0Fast](./docs/source/pi0fast.mdx), [Pi0.5](./docs/source/pi05.mdx), [GR00T N1.5](./docs/source/policy_groot_README.md), [SmolVLA](./docs/source/policy_smolvla_README.md), [XVLA](./docs/source/xvla.mdx) |
|
||||
_Replace `[...]` with your desired features._
|
||||
|
||||
Similarly to the hardware, you can easily implement your own policy & leverage LeRobot's data collection, training, and visualization tools, and share your model to the HF Hub
|
||||
**Available Tags:**
|
||||
For a full list of optional dependencies, see:
|
||||
https://pypi.org/project/lerobot/
|
||||
|
||||
For detailed policy setup guides, see the [Policy Documentation](https://huggingface.co/docs/lerobot/bring_your_own_policies).
|
||||
> [!NOTE]
|
||||
> For lerobot 0.4.0, if you want to install pi tags, you will have to do: `pip install "lerobot[pi]@git+https://github.com/huggingface/lerobot.git"`.
|
||||
>
|
||||
> This will be solved in the next patch release
|
||||
|
||||
## Inference & Evaluation
|
||||
### Weights & Biases
|
||||
|
||||
Evaluate your policies in simulation or on real hardware using the unified evaluation script. LeRobot supports standard benchmarks like **LIBERO**, **MetaWorld** and more to come.
|
||||
To use [Weights and Biases](https://docs.wandb.ai/quickstart) for experiment tracking, log in with
|
||||
|
||||
```bash
|
||||
# Evaluate a policy on the LIBERO benchmark
|
||||
lerobot-eval \
|
||||
--policy.path=lerobot/pi0_libero_finetuned \
|
||||
--env.type=libero \
|
||||
--env.task=libero_object \
|
||||
--eval.n_episodes=10
|
||||
wandb login
|
||||
```
|
||||
|
||||
Learn how to implement your own simulation environment or benchmark and distribute it from the HF Hub by following the [EnvHub Documentation](https://huggingface.co/docs/lerobot/envhub)
|
||||
(note: you will also need to enable WandB in the configuration. See below.)
|
||||
|
||||
## Resources
|
||||
### Visualize datasets
|
||||
|
||||
- **[Documentation](https://huggingface.co/docs/lerobot/index):** The complete guide to tutorials & API.
|
||||
- **[Discord](https://discord.gg/q8Dzzpym3f):** Join the `LeRobot` server to discuss with the community.
|
||||
- **[X](https://x.com/LeRobotHF):** Follow us on X to stay up-to-date with the latest developments.
|
||||
- **[Robot Learning Tutorial](https://huggingface.co/spaces/lerobot/robot-learning-tutorial):** A free, hands-on course to learn robot learning using LeRobot.
|
||||
Check out [example 1](https://github.com/huggingface/lerobot/blob/main/examples/dataset/load_lerobot_dataset.py) that illustrates how to use our dataset class which automatically downloads data from the Hugging Face hub.
|
||||
|
||||
You can also locally visualize episodes from a dataset on the hub by executing our script from the command line:
|
||||
|
||||
```bash
|
||||
lerobot-dataset-viz \
|
||||
--repo-id lerobot/pusht \
|
||||
--episode-index 0
|
||||
```
|
||||
|
||||
or from a dataset in a local folder with the `root` option and the `--mode local` (in the following case the dataset will be searched for in `./my_local_data_dir/lerobot/pusht`)
|
||||
|
||||
```bash
|
||||
lerobot-dataset-viz \
|
||||
--repo-id lerobot/pusht \
|
||||
--root ./my_local_data_dir \
|
||||
--mode local \
|
||||
--episode-index 0
|
||||
```
|
||||
|
||||
It will open `rerun.io` and display the camera streams, robot states and actions, like this:
|
||||
|
||||
https://github-production-user-asset-6210df.s3.amazonaws.com/4681518/328035972-fd46b787-b532-47e2-bb6f-fd536a55a7ed.mov?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240505%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240505T172924Z&X-Amz-Expires=300&X-Amz-Signature=d680b26c532eeaf80740f08af3320d22ad0b8a4e4da1bcc4f33142c15b509eda&X-Amz-SignedHeaders=host&actor_id=24889239&key_id=0&repo_id=748713144
|
||||
|
||||
Our script can also visualize datasets stored on a distant server. See `lerobot-dataset-viz --help` for more instructions.
|
||||
|
||||
### The `LeRobotDataset` format
|
||||
|
||||
A dataset in `LeRobotDataset` format is very simple to use. It can be loaded from a repository on the Hugging Face hub or a local folder simply with e.g. `dataset = LeRobotDataset("lerobot/aloha_static_coffee")` and can be indexed into like any Hugging Face and PyTorch dataset. For instance `dataset[0]` will retrieve a single temporal frame from the dataset containing observation(s) and an action as PyTorch tensors ready to be fed to a model.
|
||||
|
||||
A specificity of `LeRobotDataset` is that, rather than retrieving a single frame by its index, we can retrieve several frames based on their temporal relationship with the indexed frame, by setting `delta_timestamps` to a list of relative times with respect to the indexed frame. For example, with `delta_timestamps = {"observation.image": [-1, -0.5, -0.2, 0]}` one can retrieve, for a given index, 4 frames: 3 "previous" frames 1 second, 0.5 seconds, and 0.2 seconds before the indexed frame, and the indexed frame itself (corresponding to the 0 entry). See example [1_load_lerobot_dataset.py](https://github.com/huggingface/lerobot/blob/main/examples/dataset/load_lerobot_dataset.py) for more details on `delta_timestamps`.
|
||||
|
||||
Under the hood, the `LeRobotDataset` format makes use of several ways to serialize data which can be useful to understand if you plan to work more closely with this format. We tried to make a flexible yet simple dataset format that would cover most type of features and specificities present in reinforcement learning and robotics, in simulation and in real-world, with a focus on cameras and robot states but easily extended to other types of sensory inputs as long as they can be represented by a tensor.
|
||||
|
||||
Here are the important details and internal structure organization of a typical `LeRobotDataset` instantiated with `dataset = LeRobotDataset("lerobot/aloha_static_coffee")`. The exact features will change from dataset to dataset but not the main aspects:
|
||||
|
||||
```
|
||||
dataset attributes:
|
||||
├ hf_dataset: a Hugging Face dataset (backed by Arrow/parquet). Typical features example:
|
||||
│ ├ observation.images.cam_high (VideoFrame):
|
||||
│ │ VideoFrame = {'path': path to a mp4 video, 'timestamp' (float32): timestamp in the video}
|
||||
│ ├ observation.state (list of float32): position of an arm joints (for instance)
|
||||
│ ... (more observations)
|
||||
│ ├ action (list of float32): goal position of an arm joints (for instance)
|
||||
│ ├ episode_index (int64): index of the episode for this sample
|
||||
│ ├ frame_index (int64): index of the frame for this sample in the episode ; starts at 0 for each episode
|
||||
│ ├ timestamp (float32): timestamp in the episode
|
||||
│ ├ next.done (bool): indicates the end of an episode ; True for the last frame in each episode
|
||||
│ └ index (int64): general index in the whole dataset
|
||||
├ meta: a LeRobotDatasetMetadata object containing:
|
||||
│ ├ info: a dictionary of metadata on the dataset
|
||||
│ │ ├ codebase_version (str): this is to keep track of the codebase version the dataset was created with
|
||||
│ │ ├ fps (int): frame per second the dataset is recorded/synchronized to
|
||||
│ │ ├ features (dict): all features contained in the dataset with their shapes and types
|
||||
│ │ ├ total_episodes (int): total number of episodes in the dataset
|
||||
│ │ ├ total_frames (int): total number of frames in the dataset
|
||||
│ │ ├ robot_type (str): robot type used for recording
|
||||
│ │ ├ data_path (str): formattable string for the parquet files
|
||||
│ │ └ video_path (str): formattable string for the video files (if using videos)
|
||||
│ ├ episodes: a DataFrame containing episode metadata with columns:
|
||||
│ │ ├ episode_index (int): index of the episode
|
||||
│ │ ├ tasks (list): list of tasks for this episode
|
||||
│ │ ├ length (int): number of frames in this episode
|
||||
│ │ ├ dataset_from_index (int): start index of this episode in the dataset
|
||||
│ │ └ dataset_to_index (int): end index of this episode in the dataset
|
||||
│ ├ stats: a dictionary of statistics (max, mean, min, std) for each feature in the dataset, for instance
|
||||
│ │ ├ observation.images.front_cam: {'max': tensor with same number of dimensions (e.g. `(c, 1, 1)` for images, `(c,)` for states), etc.}
|
||||
│ │ └ ...
|
||||
│ └ tasks: a DataFrame containing task information with task names as index and task_index as values
|
||||
├ root (Path): local directory where the dataset is stored
|
||||
├ image_transforms (Callable): optional image transformations to apply to visual modalities
|
||||
└ delta_timestamps (dict): optional delta timestamps for temporal queries
|
||||
```
|
||||
|
||||
A `LeRobotDataset` is serialised using several widespread file formats for each of its parts, namely:
|
||||
|
||||
- hf_dataset stored using Hugging Face datasets library serialization to parquet
|
||||
- videos are stored in mp4 format to save space
|
||||
- metadata are stored in plain json/jsonl files
|
||||
|
||||
Dataset can be uploaded/downloaded from the HuggingFace hub seamlessly. To work on a local dataset, you can specify its location with the `root` argument if it's not in the default `~/.cache/huggingface/lerobot` location.
|
||||
|
||||
#### Reproduce state-of-the-art (SOTA)
|
||||
|
||||
We provide some pretrained policies on our [hub page](https://huggingface.co/lerobot) that can achieve state-of-the-art performances.
|
||||
You can reproduce their training by loading the config from their run. Simply running:
|
||||
|
||||
```bash
|
||||
lerobot-train --config_path=lerobot/diffusion_pusht
|
||||
```
|
||||
|
||||
reproduces SOTA results for Diffusion Policy on the PushT task.
|
||||
|
||||
## Contribute
|
||||
|
||||
If you would like to contribute to 🤗 LeRobot, please check out our [contribution guide](https://github.com/huggingface/lerobot/blob/main/CONTRIBUTING.md).
|
||||
|
||||
### Add a pretrained policy
|
||||
|
||||
Once you have trained a policy you may upload it to the Hugging Face hub using a hub id that looks like `${hf_user}/${repo_name}` (e.g. [lerobot/diffusion_pusht](https://huggingface.co/lerobot/diffusion_pusht)).
|
||||
|
||||
You first need to find the checkpoint folder located inside your experiment directory (e.g. `outputs/train/2024-05-05/20-21-12_aloha_act_default/checkpoints/002500`). Within that there is a `pretrained_model` directory which should contain:
|
||||
|
||||
- `config.json`: A serialized version of the policy configuration (following the policy's dataclass config).
|
||||
- `model.safetensors`: A set of `torch.nn.Module` parameters, saved in [Hugging Face Safetensors](https://huggingface.co/docs/safetensors/index) format.
|
||||
- `train_config.json`: A consolidated configuration containing all parameters used for training. The policy configuration should match `config.json` exactly. This is useful for anyone who wants to evaluate your policy or for reproducibility.
|
||||
|
||||
To upload these to the hub, run the following:
|
||||
|
||||
```bash
|
||||
huggingface-cli upload ${hf_user}/${repo_name} path/to/pretrained_model
|
||||
```
|
||||
|
||||
See [lerobot_eval.py](https://github.com/huggingface/lerobot/blob/main/src/lerobot/scripts/lerobot_eval.py) for an example of how other people may use your policy.
|
||||
|
||||
### Acknowledgment
|
||||
|
||||
- The LeRobot team 🤗 for building SmolVLA [Paper](https://arxiv.org/abs/2506.01844), [Blog](https://huggingface.co/blog/smolvla).
|
||||
- Thanks to Tony Zhao, Zipeng Fu and colleagues for open sourcing ACT policy, ALOHA environments and datasets. Ours are adapted from [ALOHA](https://tonyzhaozh.github.io/aloha) and [Mobile ALOHA](https://mobile-aloha.github.io).
|
||||
- Thanks to Cheng Chi, Zhenjia Xu and colleagues for open sourcing Diffusion policy, Pusht environment and datasets, as well as UMI datasets. Ours are adapted from [Diffusion Policy](https://diffusion-policy.cs.columbia.edu) and [UMI Gripper](https://umi-gripper.github.io).
|
||||
- Thanks to Nicklas Hansen, Yunhai Feng and colleagues for open sourcing TDMPC policy, Simxarm environments and datasets. Ours are adapted from [TDMPC](https://github.com/nicklashansen/tdmpc) and [FOWM](https://www.yunhaifeng.com/FOWM).
|
||||
- Thanks to Antonio Loquercio and Ashish Kumar for their early support.
|
||||
- Thanks to [Seungjae (Jay) Lee](https://sjlee.cc/), [Mahi Shafiullah](https://mahis.life/) and colleagues for open sourcing [VQ-BeT](https://sjlee.cc/vq-bet/) policy and helping us adapt the codebase to our repository. The policy is adapted from [VQ-BeT repo](https://github.com/jayLEE0301/vq_bet_official).
|
||||
|
||||
## Citation
|
||||
|
||||
If you use LeRobot in your research, please cite:
|
||||
If you want, you can cite this work with:
|
||||
|
||||
```bibtex
|
||||
@misc{cadene2024lerobot,
|
||||
@@ -145,14 +339,6 @@ If you use LeRobot in your research, please cite:
|
||||
}
|
||||
```
|
||||
|
||||
## Contribute
|
||||
## Star History
|
||||
|
||||
We welcome contributions from everyone in the community! To get started, please read our [CONTRIBUTING.md](./CONTRIBUTING.md) guide. Whether you're adding a new feature, improving documentation, or fixing a bug, your help and feedback are invaluable. We're incredibly excited about the future of open-source robotics and can't wait to work with you on what's next—thank you for your support!
|
||||
|
||||
<p align="center">
|
||||
<img alt="SO101 Video" src="./media/readme/so100_video.webp" width="640px">
|
||||
</p>
|
||||
|
||||
<div align="center">
|
||||
<sub>Built by the <a href="https://huggingface.co/lerobot">LeRobot</a> team at <a href="https://huggingface.co">Hugging Face</a> with ❤️</sub>
|
||||
</div>
|
||||
[](https://star-history.com/#huggingface/lerobot&Timeline)
|
||||
|
||||
|
After Width: | Height: | Size: 51 KiB |
@@ -73,7 +73,7 @@ ENV HOME=/home/user_lerobot \
|
||||
RUN uv venv --python python${PYTHON_VERSION}
|
||||
|
||||
# Install Python dependencies for caching
|
||||
COPY --chown=user_lerobot:user_lerobot setup.py pyproject.toml README.md MANIFEST.in ./
|
||||
COPY --chown=user_lerobot:user_lerobot pyproject.toml README.md MANIFEST.in ./
|
||||
COPY --chown=user_lerobot:user_lerobot src/ src/
|
||||
|
||||
ARG UNBOUND_DEPS=false
|
||||
|
||||
@@ -59,7 +59,7 @@ ENV HOME=/home/user_lerobot \
|
||||
RUN uv venv
|
||||
|
||||
# Install Python dependencies for caching
|
||||
COPY --chown=user_lerobot:user_lerobot setup.py pyproject.toml README.md MANIFEST.in ./
|
||||
COPY --chown=user_lerobot:user_lerobot pyproject.toml README.md MANIFEST.in ./
|
||||
COPY --chown=user_lerobot:user_lerobot src/ src/
|
||||
|
||||
ARG UNBOUND_DEPS=false
|
||||
|
||||
@@ -19,8 +19,6 @@
|
||||
title: Train RL in Simulation
|
||||
- local: multi_gpu_training
|
||||
title: Multi GPU training
|
||||
- local: peft_training
|
||||
title: Training with PEFT (e.g., LoRA)
|
||||
title: "Tutorials"
|
||||
- sections:
|
||||
- local: lerobot-dataset-v3
|
||||
@@ -37,21 +35,13 @@
|
||||
title: SmolVLA
|
||||
- local: pi0
|
||||
title: π₀ (Pi0)
|
||||
- local: pi0fast
|
||||
title: π₀-FAST (Pi0Fast)
|
||||
- local: pi05
|
||||
title: π₀.₅ (Pi05)
|
||||
- local: groot
|
||||
title: NVIDIA GR00T N1.5
|
||||
- local: xvla
|
||||
title: X-VLA
|
||||
- local: walloss
|
||||
title: WALL-OSS
|
||||
title: "Policies"
|
||||
- sections:
|
||||
- local: sarm
|
||||
title: SARM
|
||||
title: "Reward Models"
|
||||
- sections:
|
||||
- local: async
|
||||
title: Use Async Inference
|
||||
@@ -63,8 +53,6 @@
|
||||
title: Environments from the Hub
|
||||
- local: envhub_leisaac
|
||||
title: Control & Train Robots in Sim (LeIsaac)
|
||||
- local: envhub_isaaclab_arena
|
||||
title: NVIDIA IsaacLab Arena Environments
|
||||
- local: libero
|
||||
title: Using Libero
|
||||
- local: metaworld
|
||||
|
||||
@@ -169,7 +169,7 @@ python -m lerobot.async_inference.robot_client \
|
||||
<!-- prettier-ignore-start -->
|
||||
```python
|
||||
import threading
|
||||
from lerobot.robots.so_follower import SO100FollowerConfig
|
||||
from lerobot.robots.so100_follower import SO100FollowerConfig
|
||||
from lerobot.cameras.opencv.configuration_opencv import OpenCVCameraConfig
|
||||
from lerobot.async_inference.configs import RobotClientConfig
|
||||
from lerobot.async_inference.robot_client import RobotClient
|
||||
|
||||
@@ -12,42 +12,23 @@ The EarthRover Mini Plus is a fully open source mobile robot that connects throu
|
||||
|
||||
### Setting Up the Frodobots SDK
|
||||
|
||||
The robot needs the [Frodobots SDK](https://github.com/frodobots-org/earth-rovers-sdk) running on your computer. Here's how:
|
||||
The robot needs the [Frodobots SDK](https://github.com/Frodobots/earth-rovers-sdk) running on your computer. Here's how:
|
||||
|
||||
1. Download and install the SDK:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/frodobots-org/earth-rovers-sdk.git
|
||||
git clone https://github.com/Frodobots/earth-rovers-sdk.git
|
||||
cd earth-rovers-sdk
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
2. Save Credentials:
|
||||
|
||||
Write your .env variables with the SDK API key and bot name provided by the Frodobots team.
|
||||
|
||||
```bash
|
||||
SDK_API_TOKEN=your_sdk_api_token_here
|
||||
BOT_SLUG=your_bot_slug_here
|
||||
CHROME_EXECUTABLE_PATH=/path/to/chrome_or_chromium
|
||||
# Default value is MAP_ZOOM_LEVEL=18 https://wiki.openstreetmap.org/wiki/Zoom_levels
|
||||
MAP_ZOOM_LEVEL=18
|
||||
MISSION_SLUG=your_mission_slug_here
|
||||
# Image quality between 0.1 and 1.0 (default: 0.8)
|
||||
# Recommended: 0.8 for better performance
|
||||
IMAGE_QUALITY=0.8
|
||||
# Image format: jpeg, png or webp (default: png)
|
||||
# Recommended: jpeg for better performance and lower bandwidth usage
|
||||
IMAGE_FORMAT=jpeg
|
||||
```
|
||||
|
||||
3. Start the SDK:
|
||||
2. Start the SDK:
|
||||
|
||||
```bash
|
||||
hypercorn main:app --reload
|
||||
```
|
||||
|
||||
4. Open your web browser and go to `http://localhost:8000`, then click "Join"
|
||||
3. Open your web browser and go to `http://localhost:8000`, then click "Join"
|
||||
|
||||
The SDK gives you:
|
||||
|
||||
|
||||
@@ -2,32 +2,14 @@
|
||||
|
||||
The **EnvHub** feature allows you to load simulation environments directly from the Hugging Face Hub with a single line of code. This unlocks a powerful new model for collaboration: instead of environments being locked away inside monolithic libraries, anyone can publish custom environments and share them with the community.
|
||||
|
||||
## What is EnvHub?
|
||||
## Overview
|
||||
|
||||
EnvHub lets you create custom robotics simulation environments with your own robot models and scenarios, and make them easily usable by anyone through the LeRobot framework.
|
||||
With EnvHub, you can:
|
||||
|
||||
EnvHub packages are stored on the Hugging Face Hub, and can be seamlessly pulled and used in your AI robotics projects through LeRobot with a single line of code.
|
||||
|
||||
Thanks to EnvHub, you can:
|
||||
|
||||
1. **Create and publish environments** to the Hugging Face Hub as Git repositories, and distribute complex physics simulations without packaging hassles
|
||||
2. **Load environments** dynamically, without installing them as packages
|
||||
3. **Version and track** environment changes using Git semantics
|
||||
4. **Discover** new simulation tasks shared by the community
|
||||
|
||||
This design means you can go from discovering an interesting environment on the Hub to running experiments in seconds, or create your own custom robot and environment without worrying about dependency conflicts or complex installation procedures.
|
||||
|
||||
When you create an EnvHub package, you can build anything you want inside it and use any simulation tool you like: this is your own space to play with. The only requirement is that the package contains an `env.py` file that defines the environment and allows LeRobot to load and use your EnvHub package.
|
||||
|
||||
This `env.py` file needs to expose a small API so LeRobot can load and run it. In particular, you must provide a `make_env(n_envs: int = 1, use_async_envs: bool = False)` or `make_env(n_envs: int = 1, use_async_envs: bool = False, cfg: EnvConfig)` function, which is the main entry point for LeRobot. It should return one of:
|
||||
|
||||
- A `gym.vector.VectorEnv` (most common)
|
||||
- A single `gym.Env` (will be automatically wrapped)
|
||||
- A dict mapping `{suite_name: {task_id: VectorEnv}}` (for multi-task benchmarks)
|
||||
|
||||
You can also pass an `EnvConfig` object to `make_env` to configure the environment (e.g. the number of environments, task, camera name, initial states, control mode, episode length, etc.).
|
||||
|
||||
Finally, your environment must implement the standard `gym.vector.VectorEnv` interface so it works with LeRobot, including methods like `reset` and `step`.
|
||||
- Load environments from the Hub instantly
|
||||
- Share your custom simulation tasks with the community
|
||||
- Version control your environments using Git
|
||||
- Distribute complex physics simulations without packaging hassles
|
||||
|
||||
## Quick Start
|
||||
|
||||
@@ -47,6 +29,17 @@ env = make_env("lerobot/cartpole-env", trust_remote_code=True)
|
||||
hash for reproducibility and security.
|
||||
</Tip>
|
||||
|
||||
## What is EnvHub?
|
||||
|
||||
EnvHub is a framework that allows researchers and developers to:
|
||||
|
||||
1. **Publish environments** to the Hugging Face Hub as Git repositories
|
||||
2. **Load environments** dynamically without installing them as packages
|
||||
3. **Version and track** environment changes using Git semantics
|
||||
4. **Discover** new simulation tasks shared by the community
|
||||
|
||||
This design means you can go from discovering an interesting environment on the Hub to running experiments in seconds, without worrying about dependency conflicts or complex installation procedures.
|
||||
|
||||
## Repository Structure
|
||||
|
||||
To make your environment loadable from the Hub, your repository must contain at minimum:
|
||||
|
||||
@@ -1,510 +0,0 @@
|
||||
# NVIDIA IsaacLab Arena & LeRobot
|
||||
|
||||
LeRobot EnvHub now supports **GPU-accelerated simulation** with IsaacLab Arena for policy evaluation at scale.
|
||||
Train and evaluate imitation learning policies with high-fidelity simulation — all integrated into the LeRobot ecosystem.
|
||||
|
||||
<img
|
||||
src="https://huggingface.co/nvidia/isaaclab-arena-envs/resolve/main/assets/Gr1OpenMicrowaveEnvironment.png"
|
||||
alt="IsaacLab Arena - GR1 Microwave Environment"
|
||||
style={{ maxWidth: "100%", borderRadius: "8px", marginBottom: "1rem" }}
|
||||
/>
|
||||
|
||||
[IsaacLab Arena](https://github.com/isaac-sim/IsaacLab-Arena) integrates with NVIDIA IsaacLab to provide:
|
||||
|
||||
- 🤖 **Humanoid embodiments**: GR1, G1, Galileo with various configurations
|
||||
- 🎯 **Manipulation & loco-manipulation tasks**: Door opening, pick-and-place, button pressing, and more
|
||||
- ⚡ **GPU-accelerated rollouts**: Parallel environment execution on NVIDIA GPUs
|
||||
- 🖼️ **RTX Rendering**: Evaluate vision-based policies with realistic rendering, reflections and refractions
|
||||
- 📦 **LeRobot-compatible datasets**: Ready for training with GR00T N1x, PI0, SmolVLA, ACT, and Diffusion policies
|
||||
- 🔄 **EnvHub integration**: Load environments from HuggingFace EnvHub with one line
|
||||
|
||||
## Installation
|
||||
|
||||
### Prerequisites
|
||||
|
||||
Hardware requirements are shared with Isaac Sim, and are detailed in [Isaac Sim Requirements](https://docs.isaacsim.omniverse.nvidia.com/5.1.0/installation/requirements.html).
|
||||
|
||||
- NVIDIA GPU with CUDA support
|
||||
- NVIDIA driver compatible with IsaacSim 5.1.0
|
||||
- Linux (Ubuntu 22.04 / 24.04)
|
||||
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
# 1. Create conda environment
|
||||
conda create -y -n lerobot-arena python=3.11
|
||||
conda activate lerobot-arena
|
||||
conda install -y -c conda-forge ffmpeg=7.1.1
|
||||
|
||||
# 2. Install Isaac Sim 5.1.0
|
||||
pip install "isaacsim[all,extscache]==5.1.0" --extra-index-url https://pypi.nvidia.com
|
||||
|
||||
# Accept NVIDIA EULA (required)
|
||||
export ACCEPT_EULA=Y
|
||||
export PRIVACY_CONSENT=Y
|
||||
|
||||
# 3. Install IsaacLab 2.3.0
|
||||
git clone https://github.com/isaac-sim/IsaacLab.git
|
||||
cd IsaacLab
|
||||
git checkout v2.3.0
|
||||
./isaaclab.sh -i
|
||||
cd ..
|
||||
|
||||
# 4. Install IsaacLab Arena
|
||||
git clone https://github.com/isaac-sim/IsaacLab-Arena.git
|
||||
cd IsaacLab-Arena
|
||||
git checkout release/0.1.1
|
||||
pip install -e .
|
||||
cd ..
|
||||
|
||||
|
||||
# 5. Install LeRobot
|
||||
git clone https://github.com/huggingface/lerobot.git
|
||||
cd lerobot
|
||||
pip install -e .
|
||||
cd ..
|
||||
|
||||
|
||||
# 6. Install additional dependencies
|
||||
pip install onnxruntime==1.23.2 lightwheel-sdk==1.0.1 vuer[all]==0.0.70 qpsolvers==4.8.1
|
||||
pip install numpy==1.26.0 # Isaac Sim 5.1 depends on numpy==1.26.0, this will be fixed in next release
|
||||
```
|
||||
|
||||
## Evaluating Policies
|
||||
|
||||
### Pre-trained Policies
|
||||
|
||||
The following trained policies are available:
|
||||
|
||||
| Policy | Architecture | Task | Link |
|
||||
| :-------------------------- | :----------- | :------------ | :----------------------------------------------------------------------- |
|
||||
| pi05-arena-gr1-microwave | PI0.5 | GR1 Microwave | [HuggingFace](https://huggingface.co/nvidia/pi05-arena-gr1-microwave) |
|
||||
| smolvla-arena-gr1-microwave | SmolVLA | GR1 Microwave | [HuggingFace](https://huggingface.co/nvidia/smolvla-arena-gr1-microwave) |
|
||||
|
||||
### Evaluate SmolVLA
|
||||
|
||||
```bash
|
||||
pip install -e ".[smolvla]"
|
||||
pip install numpy==1.26.0 # revert numpy to version 1.26
|
||||
```
|
||||
|
||||
```bash
|
||||
lerobot-eval \
|
||||
--policy.path=nvidia/smolvla-arena-gr1-microwave \
|
||||
--env.type=isaaclab_arena \
|
||||
--env.hub_path=nvidia/isaaclab-arena-envs \
|
||||
--rename_map='{"observation.images.robot_pov_cam_rgb": "observation.images.robot_pov_cam"}' \
|
||||
--policy.device=cuda \
|
||||
--env.environment=gr1_microwave \
|
||||
--env.embodiment=gr1_pink \
|
||||
--env.object=mustard_bottle \
|
||||
--env.headless=false \
|
||||
--env.enable_cameras=true \
|
||||
--env.video=true \
|
||||
--env.video_length=10 \
|
||||
--env.video_interval=15 \
|
||||
--env.state_keys=robot_joint_pos \
|
||||
--env.camera_keys=robot_pov_cam_rgb \
|
||||
--trust_remote_code=True \
|
||||
--eval.batch_size=1
|
||||
```
|
||||
|
||||
### Evaluate PI0.5
|
||||
|
||||
```bash
|
||||
pip install -e ".[pi]"
|
||||
pip install numpy==1.26.0 # revert numpy to version 1.26
|
||||
```
|
||||
|
||||
<Tip>PI0.5 requires disabling torch compile for evaluation:</Tip>
|
||||
|
||||
```bash
|
||||
TORCH_COMPILE_DISABLE=1 TORCHINDUCTOR_DISABLE=1 lerobot-eval \
|
||||
--policy.path=nvidia/pi05-arena-gr1-microwave \
|
||||
--env.type=isaaclab_arena \
|
||||
--env.hub_path=nvidia/isaaclab-arena-envs \
|
||||
--rename_map='{"observation.images.robot_pov_cam_rgb": "observation.images.robot_pov_cam"}' \
|
||||
--policy.device=cuda \
|
||||
--env.environment=gr1_microwave \
|
||||
--env.embodiment=gr1_pink \
|
||||
--env.object=mustard_bottle \
|
||||
--env.headless=false \
|
||||
--env.enable_cameras=true \
|
||||
--env.video=true \
|
||||
--env.video_length=15 \
|
||||
--env.video_interval=15 \
|
||||
--env.state_keys=robot_joint_pos \
|
||||
--env.camera_keys=robot_pov_cam_rgb \
|
||||
--trust_remote_code=True \
|
||||
--eval.batch_size=1
|
||||
```
|
||||
|
||||
<Tip>
|
||||
To change the number of parallel environments, use the ```--eval.batch_size```
|
||||
flag.
|
||||
</Tip>
|
||||
|
||||
### What to Expect
|
||||
|
||||
During evaluation, you will see a progress bar showing the running success rate:
|
||||
|
||||
```
|
||||
Stepping through eval batches: 8%|██████▍ | 4/50 [00:45<08:06, 10.58s/it, running_success_rate=25.0%]
|
||||
```
|
||||
|
||||
### Video Recording
|
||||
|
||||
To enable video recording during evaluation, add the following flags to your command:
|
||||
|
||||
```bash
|
||||
--env.video=true \
|
||||
--env.video_length=15 \
|
||||
--env.video_interval=15
|
||||
```
|
||||
|
||||
For more details on video recording, see the [IsaacLab Recording Documentation](https://isaac-sim.github.io/IsaacLab/main/source/how-to/record_video.html).
|
||||
|
||||
<Tip>
|
||||
When running headless with `--env.headless=true`, you must also enable cameras explicitly for camera enabled environments:
|
||||
|
||||
```bash
|
||||
--env.headless=true --env.enable_cameras=true
|
||||
```
|
||||
|
||||
</Tip>
|
||||
|
||||
### Output Directory
|
||||
|
||||
Evaluation videos are saved to the output directory with the following structure:
|
||||
|
||||
```
|
||||
outputs/eval/<date>/<timestamp>_<env>_<policy>/videos/<task>_<env_id>/eval_episode_<n>.mp4
|
||||
```
|
||||
|
||||
For example:
|
||||
|
||||
```
|
||||
outputs/eval/2026-01-02/14-38-01_isaaclab_arena_smolvla/videos/gr1_microwave_0/eval_episode_0.mp4
|
||||
```
|
||||
|
||||
## Training Policies
|
||||
|
||||
To learn more about training policies with LeRobot, please refer to the training documentation:
|
||||
|
||||
- [SmolVLA](./smolvla)
|
||||
- [Pi0.5](./pi05)
|
||||
- [GR00T N1.5](./groot)
|
||||
|
||||
Sample IsaacLab Arena datasets are available on HuggingFace Hub for experimentation:
|
||||
|
||||
| Dataset | Description | Frames |
|
||||
| :-------------------------------------------------------------------------------------------------------- | :------------------------- | :----- |
|
||||
| [Arena-GR1-Manipulation-Task](https://huggingface.co/datasets/nvidia/Arena-GR1-Manipulation-Task-v3) | GR1 microwave manipulation | ~4K |
|
||||
| [Arena-G1-Loco-Manipulation-Task](https://huggingface.co/datasets/nvidia/Arena-G1-Loco-Manipulation-Task) | G1 loco-manipulation | ~4K |
|
||||
|
||||
## Environment Configuration
|
||||
|
||||
### Full Configuration Options
|
||||
|
||||
```python
|
||||
from lerobot.envs.configs import IsaaclabArenaEnv
|
||||
|
||||
config = IsaaclabArenaEnv(
|
||||
# Environment selection
|
||||
environment="gr1_microwave", # Task environment
|
||||
embodiment="gr1_pink", # Robot embodiment
|
||||
object="power_drill", # Object to manipulate
|
||||
|
||||
# Simulation settings
|
||||
episode_length=300, # Max steps per episode
|
||||
headless=True, # Run without GUI
|
||||
device="cuda:0", # GPU device
|
||||
seed=42, # Random seed
|
||||
|
||||
# Observation configuration
|
||||
state_keys="robot_joint_pos", # State observation keys (comma-separated)
|
||||
camera_keys="robot_pov_cam_rgb", # Camera observation keys (comma-separated)
|
||||
state_dim=54, # Expected state dimension
|
||||
action_dim=36, # Expected action dimension
|
||||
camera_height=512, # Camera image height
|
||||
camera_width=512, # Camera image width
|
||||
enable_cameras=True, # Enable camera observations
|
||||
|
||||
# Video recording
|
||||
video=False, # Enable video recording
|
||||
video_length=100, # Frames per video
|
||||
video_interval=200, # Steps between recordings
|
||||
|
||||
# Advanced
|
||||
mimic=False, # Enable mimic mode
|
||||
teleop_device=None, # Teleoperation device
|
||||
disable_fabric=False, # Disable fabric optimization
|
||||
enable_pinocchio=True, # Enable Pinocchio for IK
|
||||
)
|
||||
```
|
||||
|
||||
### Using Environment Hub directly for advanced usage
|
||||
|
||||
Create a file called `test_env_load_arena.py` or [download from the EnvHub](https://huggingface.co/nvidia/isaaclab-arena-envs/blob/main/tests/test_env_load_arena.py):
|
||||
|
||||
```python
|
||||
import logging
|
||||
from dataclasses import asdict
|
||||
from pprint import pformat
|
||||
import torch
|
||||
import tqdm
|
||||
from lerobot.configs import parser
|
||||
from lerobot.configs.eval import EvalPipelineConfig
|
||||
|
||||
|
||||
@parser.wrap()
|
||||
def main(cfg: EvalPipelineConfig):
|
||||
"""Run random action rollout for IsaacLab Arena environment."""
|
||||
logging.info(pformat(asdict(cfg)))
|
||||
|
||||
from lerobot.envs.factory import make_env
|
||||
|
||||
env_dict = make_env(
|
||||
cfg.env,
|
||||
n_envs=cfg.env.num_envs,
|
||||
trust_remote_code=True,
|
||||
)
|
||||
env = next(iter(env_dict.values()))[0]
|
||||
env.reset()
|
||||
for _ in tqdm.tqdm(range(cfg.env.episode_length)):
|
||||
with torch.inference_mode():
|
||||
actions = env.action_space.sample()
|
||||
obs, rewards, terminated, truncated, info = env.step(actions)
|
||||
if terminated.any() or truncated.any():
|
||||
obs, info = env.reset()
|
||||
env.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
```
|
||||
|
||||
Run with:
|
||||
|
||||
```bash
|
||||
python test_env_load_arena.py \
|
||||
--env.environment=g1_locomanip_pnp \
|
||||
--env.embodiment=gr1_pink \
|
||||
--env.object=cracker_box \
|
||||
--env.num_envs=4 \
|
||||
--env.enable_cameras=true \
|
||||
--env.seed=1000 \
|
||||
--env.video=true \
|
||||
--env.video_length=10 \
|
||||
--env.video_interval=15 \
|
||||
--env.headless=false \
|
||||
--env.hub_path=nvidia/isaaclab-arena-envs \
|
||||
--env.type=isaaclab_arena
|
||||
```
|
||||
|
||||
## Creating New Environments
|
||||
|
||||
First create a new IsaacLab Arena environment by following the [IsaacLab Arena Documentation](https://isaac-sim.github.io/IsaacLab-Arena/release/0.1.1/index.html).
|
||||
|
||||
Clone our EnvHub repo:
|
||||
|
||||
```bash
|
||||
git clone https://huggingface.co/nvidia/isaaclab-arena-envs
|
||||
```
|
||||
|
||||
Modify the `example_envs.yaml` file based on your new environment.
|
||||
[Upload](./envhub#step-3-upload-to-the-hub) your modified repo to HuggingFace EnvHub.
|
||||
|
||||
<Tip>
|
||||
Your IsaacLab Arena environment code must be locally available during
|
||||
evaluation. Users can clone your environment repository separately, or you can
|
||||
bundle the environment code and assets directly in your EnvHub repo.
|
||||
</Tip>
|
||||
|
||||
Then, when evaluating, use your new environment:
|
||||
|
||||
```bash
|
||||
lerobot-eval \
|
||||
--env.hub_path=<your-env-hub-path>/isaaclab-arena-envs \
|
||||
--env.environment=<your new environment> \
|
||||
...other flags...
|
||||
```
|
||||
|
||||
We look forward to your contributions!
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### CUDA out of memory
|
||||
|
||||
Reduce `batch_size` or use a GPU with more VRAM:
|
||||
|
||||
```bash
|
||||
--eval.batch_size=1
|
||||
```
|
||||
|
||||
### EULA not accepted
|
||||
|
||||
Set environment variables before running:
|
||||
|
||||
```bash
|
||||
export ACCEPT_EULA=Y
|
||||
export PRIVACY_CONSENT=Y
|
||||
```
|
||||
|
||||
### Video recording not working
|
||||
|
||||
Enable cameras when running headless:
|
||||
|
||||
```bash
|
||||
--env.video=true --env.enable_cameras=true --env.headless=true
|
||||
```
|
||||
|
||||
### Policy output dimension mismatch
|
||||
|
||||
Ensure `action_dim` matches your policy:
|
||||
|
||||
```bash
|
||||
--env.action_dim=36
|
||||
```
|
||||
|
||||
### libGLU.so.1 Errors during Isaac Sim initialization
|
||||
|
||||
Ensure you have the following dependencies installed, this is likely to happen on headless machines.
|
||||
|
||||
```bash
|
||||
sudo apt update && sudo apt install -y libglu1-mesa libxt6
|
||||
```
|
||||
|
||||
## See Also
|
||||
|
||||
- [EnvHub Documentation](./envhub.mdx) - General EnvHub usage
|
||||
- [IsaacLab Arena GitHub](https://github.com/isaac-sim/IsaacLab-Arena)
|
||||
- [IsaacLab Documentation](https://isaac-sim.github.io/IsaacLab/)
|
||||
|
||||
## Lightwheel LW-BenchHub
|
||||
|
||||
[Lightwheel](https://www.lightwheel.ai) is bringing `Lightwheel-Libero-Tasks` and `Lightwheel-RoboCasa-Tasks` with 268 tasks to the LeRobot ecosystem.
|
||||
LW-BenchHub collects and generates large-scale datasets via teleoperation that comply with the LeRobot specification, enabling out-of-the-box training and evaluation workflows.
|
||||
With the unified interface provided by EnvHub, developers can quickly build end-to-end experimental pipelines.
|
||||
|
||||
### Install
|
||||
|
||||
Assuming you followed the [Installation](#installation) steps, you can install LW-BenchHub with:
|
||||
|
||||
```bash
|
||||
conda install pinocchio -c conda-forge -y
|
||||
pip install numpy==1.26.0 # revert numpy to version 1.26
|
||||
|
||||
sudo apt-get install git-lfs && git lfs install
|
||||
|
||||
git clone https://github.com/LightwheelAI/lw_benchhub
|
||||
git lfs pull # Ensure LFS files (e.g., .usd assets) are downloaded
|
||||
|
||||
cd lw_benchhub
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
For more detailed instructions, please refer to the [LW-BenchHub Documentation](https://docs.lightwheel.net/lw_benchhub/usage/Installation).
|
||||
|
||||
### Lightwheel Tasks Dataset
|
||||
|
||||
LW-BenchHub datasets are available on HuggingFace Hub:
|
||||
|
||||
| Dataset | Description | Tasks | Frames |
|
||||
| :------------------------------------------------------------------------------------------------------------ | :---------------------- | :---- | :----- |
|
||||
| [Lightwheel-Tasks-X7S](https://huggingface.co/datasets/LightwheelAI/Lightwheel-Tasks-X7S) | X7S LIBERO and RoboCasa | 117 | ~10.3M |
|
||||
| [Lightwheel-Tasks-Double-Piper](https://huggingface.co/datasets/LightwheelAI/Lightwheel-Tasks-Double-Piper) | Double-Piper LIBERO | 130 | ~6.0M |
|
||||
| [Lightwheel-Tasks-G1-Controller](https://huggingface.co/datasets/LightwheelAI/Lightwheel-Tasks-G1-Controller) | G1-Controller LIBERO | 62 | ~2.7M |
|
||||
| [Lightwheel-Tasks-G1-WBC](https://huggingface.co/datasets/LightwheelAI/Lightwheel-Tasks-G1-WBC) | G1-WBC RoboCasa | 32 | ~1.5M |
|
||||
|
||||
For training policies, refer to the [Training Policies](#training-policies) section.
|
||||
|
||||
### Evaluating Policies
|
||||
|
||||
#### Pre-trained Policies
|
||||
|
||||
The following trained policies are available:
|
||||
|
||||
| Policy | Architecture | Task | Layout | Robot | Link |
|
||||
| :----------------------- | :----------- | :----------------------------- | :--------- | :-------------- | :------------------------------------------------------------------------------------ |
|
||||
| smolvla-double-piper-pnp | SmolVLA | L90K1PutTheBlackBowlOnThePlate | libero-1-1 | DoublePiper-Abs | [HuggingFace](https://huggingface.co/LightwheelAI/smolvla-double-piper-pnp/tree/main) |
|
||||
|
||||
#### Evaluate SmolVLA
|
||||
|
||||
```bash
|
||||
lerobot-eval \
|
||||
--policy.path=LightwheelAI/smolvla-double-piper-pnp \
|
||||
--env.type=isaaclab_arena \
|
||||
--rename_map='{"observation.images.left_hand_camera_rgb": "observation.images.left_hand", "observation.images.right_hand_camera_rgb": "observation.images.right_hand", "observation.images.first_person_camera_rgb": "observation.images.first_person"}' \
|
||||
--env.hub_path=LightwheelAI/lw_benchhub_env \
|
||||
--env.kwargs='{"config_path": "configs/envhub/example.yml"}' \
|
||||
--trust_remote_code=true \
|
||||
--env.state_keys=joint_pos \
|
||||
--env.action_dim=12 \
|
||||
--env.camera_keys=left_hand_camera_rgb,right_hand_camera_rgb,first_person_camera_rgb \
|
||||
--policy.device=cuda \
|
||||
--eval.batch_size=10 \
|
||||
--eval.n_episodes=100
|
||||
```
|
||||
|
||||
### Environment Configuration
|
||||
|
||||
Evaluation can be quickly launched by modifying the `robot`, `task`, and `layout` settings in the configuration file.
|
||||
|
||||
#### Full Configuration Options
|
||||
|
||||
```yml
|
||||
# =========================
|
||||
# Basic Settings
|
||||
# =========================
|
||||
disable_fabric: false
|
||||
device: cuda:0
|
||||
sensitivity: 1.0
|
||||
step_hz: 50
|
||||
enable_cameras: true
|
||||
execute_mode: eval
|
||||
episode_length_s: 20.0 # Episode length in seconds, increase if episodes timeout during eval
|
||||
|
||||
# =========================
|
||||
# Robot Settings
|
||||
# =========================
|
||||
robot: DoublePiper-Abs # Robot type, DoublePiper-Abs, X7S-Abs, G1-Controller or G1-Controller-DecoupledWBC
|
||||
robot_scale: 1.0
|
||||
|
||||
# =========================
|
||||
# Task & Scene Settings
|
||||
# =========================
|
||||
task: L90K1PutTheBlackBowlOnThePlate # Task name
|
||||
scene_backend: robocasa
|
||||
task_backend: robocasa
|
||||
debug_assets: null
|
||||
layout: libero-1-1 # Layout and style ID
|
||||
sources:
|
||||
- objaverse
|
||||
- lightwheel
|
||||
- aigen_objs
|
||||
object_projects: []
|
||||
usd_simplify: false
|
||||
seed: 42
|
||||
|
||||
# =========================
|
||||
# Object Placement Retry Settings
|
||||
# =========================
|
||||
max_scene_retry: 4
|
||||
max_object_placement_retry: 3
|
||||
|
||||
resample_objects_placement_on_reset: true
|
||||
resample_robot_placement_on_reset: true
|
||||
|
||||
# =========================
|
||||
# Replay Configuration Settings
|
||||
# =========================
|
||||
replay_cfgs:
|
||||
add_camera_to_observation: true
|
||||
render_resolution: [640, 480]
|
||||
```
|
||||
|
||||
### See Also
|
||||
|
||||
- [LW-BenchHub GitHub](https://github.com/LightwheelAI/LW-BenchHub)
|
||||
- [LW-BenchHub Documentation](https://docs.lightwheel.net/lw_benchhub/)
|
||||
@@ -137,8 +137,7 @@ from lerobot.teleoperators import ( # noqa: F401
|
||||
Teleoperator,
|
||||
TeleoperatorConfig,
|
||||
make_teleoperator_from_config,
|
||||
so_leader,
|
||||
bi_so_leader,
|
||||
so101_leader,
|
||||
)
|
||||
from lerobot.utils.robot_utils import precise_sleep
|
||||
from lerobot.utils.utils import init_logging
|
||||
@@ -197,7 +196,7 @@ def teleop_loop(teleop: Teleoperator, env: gym.Env, fps: int):
|
||||
obs, info = env.reset()
|
||||
|
||||
dt_s = time.perf_counter() - loop_start
|
||||
precise_sleep(max(1 / fps - dt_s, 0.0))
|
||||
precise_sleep(1 / fps - dt_s)
|
||||
loop_s = time.perf_counter() - loop_start
|
||||
print(f"\ntime: {loop_s * 1e3:.2f}ms ({1 / loop_s:.0f} Hz)")
|
||||
|
||||
@@ -223,7 +222,7 @@ def teleoperate(cfg: TeleoperateConfig):
|
||||
|
||||
def main():
|
||||
teleoperate(TeleoperateConfig(
|
||||
teleop=so_leader.SO101LeaderConfig(
|
||||
teleop=so101_leader.SO101LeaderConfig(
|
||||
port="/dev/ttyACM0",
|
||||
id='leader',
|
||||
use_degrees=False,
|
||||
|
||||
@@ -12,12 +12,6 @@ Developers and researchers can post-train GR00T N1.5 with their own real or synt
|
||||
|
||||
GR00T N1.5 (specifically the GR00T-N1.5-3B model) is built using pre-trained vision and language encoders. It utilizes a flow matching action transformer to model a chunk of actions, conditioned on vision, language, and proprioception.
|
||||
|
||||
<img
|
||||
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/lerobot-groot-paper1%20(1).png"
|
||||
alt="An overview of GR00T"
|
||||
width="80%"
|
||||
/>
|
||||
|
||||
Its strong performance comes from being trained on an expansive and diverse humanoid dataset, which includes:
|
||||
|
||||
- Real captured data from robots.
|
||||
@@ -109,7 +103,7 @@ Once you have trained your model using your parameters you can run inference in
|
||||
|
||||
```bash
|
||||
lerobot-record \
|
||||
--robot.type=bi_so_follower \
|
||||
--robot.type=bi_so100_follower \
|
||||
--robot.left_arm_port=/dev/ttyACM1 \
|
||||
--robot.right_arm_port=/dev/ttyACM0 \
|
||||
--robot.id=bimanual_follower \
|
||||
|
||||
@@ -58,8 +58,8 @@ lerobot-teleoperate \
|
||||
|
||||
<!-- prettier-ignore-start -->
|
||||
```python
|
||||
from lerobot.teleoperators.so_leader import SO101LeaderConfig, SO101Leader
|
||||
from lerobot.robots.so_follower import SO101FollowerConfig, SO101Follower
|
||||
from lerobot.teleoperators.so101_leader import SO101LeaderConfig, SO101Leader
|
||||
from lerobot.robots.so101_follower import SO101FollowerConfig, SO101Follower
|
||||
|
||||
robot_config = SO101FollowerConfig(
|
||||
port="/dev/tty.usbmodem58760431541",
|
||||
@@ -195,14 +195,13 @@ lerobot-record \
|
||||
from lerobot.cameras.opencv.configuration_opencv import OpenCVCameraConfig
|
||||
from lerobot.datasets.lerobot_dataset import LeRobotDataset
|
||||
from lerobot.datasets.utils import hw_to_dataset_features
|
||||
from lerobot.robots.so_follower import SO100Follower, SO100FollowerConfig
|
||||
from lerobot.teleoperators.so_leader.config_so100_leader import SO100LeaderConfig
|
||||
from lerobot.teleoperators.so_leader.so100_leader import SO100Leader
|
||||
from lerobot.robots.so100_follower import SO100Follower, SO100FollowerConfig
|
||||
from lerobot.teleoperators.so100_leader.config_so100_leader import SO100LeaderConfig
|
||||
from lerobot.teleoperators.so100_leader.so100_leader import SO100Leader
|
||||
from lerobot.utils.control_utils import init_keyboard_listener
|
||||
from lerobot.utils.utils import log_say
|
||||
from lerobot.utils.visualization_utils import init_rerun
|
||||
from lerobot.scripts.lerobot_record import record_loop
|
||||
from lerobot.processor import make_default_processors
|
||||
from lerobot.record import record_loop
|
||||
|
||||
NUM_EPISODES = 5
|
||||
FPS = 30
|
||||
@@ -210,19 +209,12 @@ EPISODE_TIME_SEC = 60
|
||||
RESET_TIME_SEC = 10
|
||||
TASK_DESCRIPTION = "My task description"
|
||||
|
||||
# Create robot configuration
|
||||
# Create the robot and teleoperator configurations
|
||||
camera_config = {"front": OpenCVCameraConfig(index_or_path=0, width=640, height=480, fps=FPS)}
|
||||
robot_config = SO100FollowerConfig(
|
||||
id="my_awesome_follower_arm",
|
||||
cameras={
|
||||
"front": OpenCVCameraConfig(index_or_path=0, width=640, height=480, fps=FPS) # Optional: fourcc="MJPG" for troubleshooting OpenCV async error.
|
||||
},
|
||||
port="/dev/tty.usbmodem58760434471",
|
||||
)
|
||||
|
||||
teleop_config = SO100LeaderConfig(
|
||||
id="my_awesome_leader_arm",
|
||||
port="/dev/tty.usbmodem585A0077581",
|
||||
port="/dev/tty.usbmodem58760434471", id="my_awesome_follower_arm", cameras=camera_config
|
||||
)
|
||||
teleop_config = SO100LeaderConfig(port="/dev/tty.usbmodem585A0077581", id="my_awesome_leader_arm")
|
||||
|
||||
# Initialize the robot and teleoperator
|
||||
robot = SO100Follower(robot_config)
|
||||
@@ -251,9 +243,6 @@ init_rerun(session_name="recording")
|
||||
robot.connect()
|
||||
teleop.connect()
|
||||
|
||||
# Create the required processors
|
||||
teleop_action_processor, robot_action_processor, robot_observation_processor = make_default_processors()
|
||||
|
||||
episode_idx = 0
|
||||
while episode_idx < NUM_EPISODES and not events["stop_recording"]:
|
||||
log_say(f"Recording episode {episode_idx + 1} of {NUM_EPISODES}")
|
||||
@@ -262,9 +251,6 @@ while episode_idx < NUM_EPISODES and not events["stop_recording"]:
|
||||
robot=robot,
|
||||
events=events,
|
||||
fps=FPS,
|
||||
teleop_action_processor=teleop_action_processor,
|
||||
robot_action_processor=robot_action_processor,
|
||||
robot_observation_processor=robot_observation_processor,
|
||||
teleop=teleop,
|
||||
dataset=dataset,
|
||||
control_time_s=EPISODE_TIME_SEC,
|
||||
@@ -279,9 +265,6 @@ while episode_idx < NUM_EPISODES and not events["stop_recording"]:
|
||||
robot=robot,
|
||||
events=events,
|
||||
fps=FPS,
|
||||
teleop_action_processor=teleop_action_processor,
|
||||
robot_action_processor=robot_action_processor,
|
||||
robot_observation_processor=robot_observation_processor,
|
||||
teleop=teleop,
|
||||
control_time_s=RESET_TIME_SEC,
|
||||
single_task=TASK_DESCRIPTION,
|
||||
@@ -408,8 +391,8 @@ lerobot-replay \
|
||||
import time
|
||||
|
||||
from lerobot.datasets.lerobot_dataset import LeRobotDataset
|
||||
from lerobot.robots.so_follower.config_so100_follower import SO100FollowerConfig
|
||||
from lerobot.robots.so_follower.so100_follower import SO100Follower
|
||||
from lerobot.robots.so100_follower.config_so100_follower import SO100FollowerConfig
|
||||
from lerobot.robots.so100_follower.so100_follower import SO100Follower
|
||||
from lerobot.utils.robot_utils import precise_sleep
|
||||
from lerobot.utils.utils import log_say
|
||||
|
||||
@@ -432,7 +415,7 @@ for idx in range(dataset.num_frames):
|
||||
}
|
||||
robot.send_action(action)
|
||||
|
||||
precise_sleep(max(1.0 / dataset.fps - (time.perf_counter() - t0), 0.0))
|
||||
precise_sleep(1.0 / dataset.fps - (time.perf_counter() - t0))
|
||||
|
||||
robot.disconnect()
|
||||
```
|
||||
@@ -531,8 +514,8 @@ from lerobot.datasets.lerobot_dataset import LeRobotDataset
|
||||
from lerobot.datasets.utils import hw_to_dataset_features
|
||||
from lerobot.policies.act.modeling_act import ACTPolicy
|
||||
from lerobot.policies.factory import make_pre_post_processors
|
||||
from lerobot.robots.so_follower.config_so100_follower import SO100FollowerConfig
|
||||
from lerobot.robots.so_follower.so100_follower import SO100Follower
|
||||
from lerobot.robots.so100_follower.config_so100_follower import SO100FollowerConfig
|
||||
from lerobot.robots.so100_follower.so100_follower import SO100Follower
|
||||
from lerobot.scripts.lerobot_record import record_loop
|
||||
from lerobot.utils.control_utils import init_keyboard_listener
|
||||
from lerobot.utils.utils import log_say
|
||||
|
||||
@@ -18,7 +18,7 @@ If you're using Feetech or Dynamixel motors, LeRobot provides built-in bus inter
|
||||
- [`DynamixelMotorsBus`](https://github.com/huggingface/lerobot/blob/main/src/lerobot/motors/dynamixel/dynamixel.py) – for controlling Dynamixel servos
|
||||
|
||||
Please refer to the [`MotorsBus`](https://github.com/huggingface/lerobot/blob/main/src/lerobot/motors/motors_bus.py) abstract class to learn about its API.
|
||||
For a good example of how it can be used, you can have a look at our own [SO101 follower implementation](https://github.com/huggingface/lerobot/blob/main/src/lerobot/robots/so_follower/so101_follower/so101_follower.py)
|
||||
For a good example of how it can be used, you can have a look at our own [SO101 follower implementation](https://github.com/huggingface/lerobot/blob/main/src/lerobot/robots/so101_follower/so101_follower.py)
|
||||
|
||||
Use these if compatible. Otherwise, you'll need to find or write a Python interface (not covered in this tutorial):
|
||||
|
||||
|
||||
@@ -204,7 +204,7 @@ lerobot-calibrate \
|
||||
|
||||
<!-- prettier-ignore-start -->
|
||||
```python
|
||||
from lerobot.teleoperators.so_leader import SO100LeaderConfig, SO100Leader
|
||||
from lerobot.teleoperators.so100_leader import SO100LeaderConfig, SO100Leader
|
||||
|
||||
config = SO100LeaderConfig(
|
||||
port="/dev/tty.usbmodem58760431551",
|
||||
|
||||
@@ -1,62 +0,0 @@
|
||||
# Parameter efficient fine-tuning with 🤗 PEFT
|
||||
|
||||
[🤗 PEFT](https://github.com/huggingface/peft) (Parameter-Efficient Fine-Tuning) is a library for efficiently adapting
|
||||
large pretrained models such as pre-trained policies (e.g., SmolVLA, π₀, ...) to new tasks without training all
|
||||
of the model's parameters while yielding comparable performance.
|
||||
|
||||
Install the `lerobot[peft]` optional package to enable PEFT support.
|
||||
|
||||
To read about all the possible methods of adaption, please refer to the [🤗 PEFT docs](https://huggingface.co/docs/peft/index).
|
||||
|
||||
## Training SmolVLA
|
||||
|
||||
In this section we'll show you how to train a pre-trained SmolVLA policy with PEFT on the libero dataset.
|
||||
For brevity we're only training on the `libero_spatial` subset. We will use `lerobot/smolvla_base` as the model
|
||||
to parameter efficiently fine-tune:
|
||||
|
||||
```
|
||||
lerobot-train \
|
||||
--policy.path=lerobot/smolvla_base \
|
||||
--policy.repo_id=your_hub_name/my_libero_smolvla \
|
||||
--dataset.repo_id=HuggingFaceVLA/libero \
|
||||
--policy.output_features=null \
|
||||
--policy.input_features=null \
|
||||
--policy.optimizer_lr=1e-3 \
|
||||
--policy.scheduler_decay_lr=1e-4 \
|
||||
--env.type=libero \
|
||||
--env.task=libero_spatial \
|
||||
--steps=100000 \
|
||||
--batch_size=32 \
|
||||
--peft.method_type=LORA \
|
||||
--peft.r=64
|
||||
```
|
||||
|
||||
Note the `--peft.method_type` parameter that let's you select which PEFT method to use. Here we use
|
||||
[LoRA](https://huggingface.co/docs/peft/main/en/package_reference/lora) (Low-Rank Adapter) which is probably the most
|
||||
popular fine-tuning method to date. Low-rank adaption means that we only fine-tune a matrix with comparably low rank
|
||||
instead of the full weight matrix. This rank can be specified using the `--peft.r` parameter. The higher the rank
|
||||
the closer you get to full fine-tuning
|
||||
|
||||
There are more complex methods that have more parameters. These are not yet supported, feel free to raise an issue
|
||||
if you want to see a specific PEFT method supported.
|
||||
|
||||
By default, PEFT will target the `q_proj` and `v_proj` layers of the LM expert in SmolVLA. It will also target the
|
||||
state and action projection matrices as they are most likely task-dependent. If you need to target different layers
|
||||
you can use `--peft.target_modules` to specify which layers to target. You can refer to the respective PEFT method's
|
||||
documentation to see what inputs are supported, (e.g., [LoRA's target_modules documentation](https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraConfig.target_modules)).
|
||||
Usually a list of suffixes or a regex are supported. For example, to target the MLPs of the `lm_expert` instead of
|
||||
the `q` and `v` projections, use:
|
||||
|
||||
```
|
||||
--peft.target_modules='(model\.vlm_with_expert\.lm_expert\..*\.(down|gate|up)_proj|.*\.(state_proj|action_in_proj|action_out_proj|action_time_mlp_in|action_time_mlp_out))'
|
||||
```
|
||||
|
||||
In case you need to fully fine-tune a layer instead of just adapting it, you can supply a list of layer suffixes
|
||||
to the `--peft.full_training_modules` parameter:
|
||||
|
||||
```
|
||||
--peft.full_training_modules=["state_proj"]
|
||||
```
|
||||
|
||||
The learning rate and the scheduled target learning rate can usually be scaled by a factor of 10 compared to the
|
||||
learning rate used for full fine-tuning (e.g., 1e-4 normal, so 1e-3 using LoRA).
|
||||
@@ -44,7 +44,7 @@ Modify the examples to use `PhoneOS.IOS` or `PhoneOS.ANDROID` in `PhoneConfig`.
|
||||
|
||||
Teleoperation example:
|
||||
|
||||
```python
|
||||
```36:43:examples/phone_so100_teleop.py
|
||||
from lerobot.teleoperators.phone.config_phone import PhoneConfig, PhoneOS
|
||||
|
||||
teleop_config = PhoneConfig(phone_os=PhoneOS.IOS) # or PhoneOS.ANDROID
|
||||
@@ -103,7 +103,7 @@ Additionally you can customize mapping or safety limits by editing the processor
|
||||
|
||||
- Kinematics are used in multiple steps. We use [Placo](https://github.com/Rhoban/placo) which is a wrapper around Pinocchio for handling our kinematics. We construct the kinematics object by passing the robot's URDF and target frame. We set `target_frame_name` to the gripper frame.
|
||||
|
||||
```python
|
||||
```examples/phone_to_so100/teleoperate.py
|
||||
kinematics_solver = RobotKinematics(
|
||||
urdf_path="./SO101/so101_new_calib.urdf",
|
||||
target_frame_name="gripper_frame_link",
|
||||
@@ -114,7 +114,7 @@ Additionally you can customize mapping or safety limits by editing the processor
|
||||
|
||||
- The `MapPhoneActionToRobotAction` step converts the calibrated phone pose and inputs into target deltas and gripper commands, below is shown what the step outputs.
|
||||
|
||||
```python
|
||||
```src/lerobot/teleoperators/phone/phone_processor.py
|
||||
action["enabled"] = enabled
|
||||
action["target_x"] = -pos[1] if enabled else 0.0
|
||||
action["target_y"] = pos[0] if enabled else 0.0
|
||||
@@ -127,7 +127,7 @@ Additionally you can customize mapping or safety limits by editing the processor
|
||||
|
||||
- The `EEReferenceAndDelta` step converts target deltas to an absolute desired EE pose, storing a reference on enable, the `end_effector_step_sizes` are the step sizes for the EE pose and can be modified to change the motion speed.
|
||||
|
||||
```python
|
||||
```examples/phone_to_so100/teleoperate.py
|
||||
EEReferenceAndDelta(
|
||||
kinematics=kinematics_solver,
|
||||
end_effector_step_sizes={"x": 0.5, "y": 0.5, "z": 0.5},
|
||||
@@ -138,7 +138,7 @@ Additionally you can customize mapping or safety limits by editing the processor
|
||||
|
||||
- The `EEBoundsAndSafety` step clamps EE motion to a workspace and checks for large ee step jumps to ensure safety. The `end_effector_bounds` are the bounds for the EE pose and can be modified to change the workspace. The `max_ee_step_m` are the step limits for the EE pose and can be modified to change the safety limits.
|
||||
|
||||
```python
|
||||
```examples/phone_to_so100/teleoperate.py
|
||||
EEBoundsAndSafety(
|
||||
end_effector_bounds={"min": [-1.0, -1.0, -1.0], "max": [1.0, 1.0, 1.0]},
|
||||
max_ee_step_m=0.10,
|
||||
@@ -147,7 +147,7 @@ Additionally you can customize mapping or safety limits by editing the processor
|
||||
|
||||
- The `GripperVelocityToJoint` step turns a velocity‑like gripper input into absolute gripper position using the current measured state. The `speed_factor` is the factor by which the velocity is multiplied.
|
||||
|
||||
```python
|
||||
```examples/phone_to_so100/teleoperate.py
|
||||
GripperVelocityToJoint(speed_factor=20.0)
|
||||
```
|
||||
|
||||
@@ -157,7 +157,7 @@ We use different IK initial guesses in the kinematic steps. As initial guess eit
|
||||
|
||||
- Closed loop (used in record/eval): sets `initial_guess_current_joints=True` so IK starts from the measured joints each frame.
|
||||
|
||||
```python
|
||||
```examples/phone_to_so100/record.py
|
||||
InverseKinematicsEEToJoints(
|
||||
kinematics=kinematics_solver,
|
||||
motor_names=list(robot.bus.motors.keys()),
|
||||
@@ -167,7 +167,7 @@ We use different IK initial guesses in the kinematic steps. As initial guess eit
|
||||
|
||||
- Open loop (used in replay): sets `initial_guess_current_joints=False` so IK continues from the previous IK solution rather than the measured state. This preserves action stability when we replay without feedback.
|
||||
|
||||
```python
|
||||
```examples/phone_to_so100/replay.py
|
||||
InverseKinematicsEEToJoints(
|
||||
kinematics=kinematics_solver,
|
||||
motor_names=list(robot.bus.motors.keys()),
|
||||
|
||||
@@ -6,12 +6,6 @@
|
||||
|
||||
π₀ represents a breakthrough in robotics as the first general-purpose robot foundation model developed by [Physical Intelligence](https://www.physicalintelligence.company/blog/pi0). Unlike traditional robot programs that are narrow specialists programmed for repetitive motions, π₀ is designed to be a generalist policy that can understand visual inputs, interpret natural language instructions, and control a variety of different robots across diverse tasks.
|
||||
|
||||
<img
|
||||
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/lerobot-pi0%20(1).png"
|
||||
alt="An overview of Pi0"
|
||||
width="85%"
|
||||
/>
|
||||
|
||||
### The Vision for Physical Intelligence
|
||||
|
||||
As described by Physical Intelligence, while AI has achieved remarkable success in digital domains, from chess-playing to drug discovery, human intelligence still dramatically outpaces AI in the physical world. To paraphrase Moravec's paradox, winning a game of chess represents an "easy" problem for AI, but folding a shirt or cleaning up a table requires solving some of the most difficult engineering problems ever conceived. π₀ represents a first step toward developing artificial physical intelligence that enables users to simply ask robots to perform any task they want, just like they can with large language models.
|
||||
@@ -70,8 +64,6 @@ python src/lerobot/scripts/lerobot_train.py \
|
||||
--policy.compile_model=true \
|
||||
--policy.gradient_checkpointing=true \
|
||||
--policy.dtype=bfloat16 \
|
||||
--policy.freeze_vision_encoder=false \
|
||||
--policy.train_expert_only=false \
|
||||
--steps=3000 \
|
||||
--policy.device=cuda \
|
||||
--batch_size=32
|
||||
@@ -87,15 +79,6 @@ python src/lerobot/scripts/lerobot_train.py \
|
||||
- [lerobot/pi0_base](https://huggingface.co/lerobot/pi0_base)
|
||||
- [lerobot/pi0_libero](https://huggingface.co/lerobot/pi0_libero) (specifically trained on the Libero dataset)
|
||||
|
||||
### Training Parameters Explained
|
||||
|
||||
| Parameter | Default | Description |
|
||||
| ----------------------- | ------- | ------------------------------------------- |
|
||||
| `freeze_vision_encoder` | `false` | Do not freeze the vision encoder |
|
||||
| `train_expert_only` | `false` | Do not freeze the VLM, train all parameters |
|
||||
|
||||
**💡 Tip**: Setting `train_expert_only=true` freezes the VLM and trains only the action expert and projections, allowing finetuning with reduced memory usage.
|
||||
|
||||
## License
|
||||
|
||||
This model follows the **Apache 2.0 License**, consistent with the original [OpenPI repository](https://github.com/Physical-Intelligence/openpi).
|
||||
|
||||
@@ -67,8 +67,6 @@ python src/lerobot/scripts/lerobot_train.py\
|
||||
--policy.gradient_checkpointing=true \
|
||||
--wandb.enable=true \
|
||||
--policy.dtype=bfloat16 \
|
||||
--policy.freeze_vision_encoder=false \
|
||||
--policy.train_expert_only=false \
|
||||
--steps=3000 \
|
||||
--policy.device=cuda \
|
||||
--batch_size=32
|
||||
@@ -84,15 +82,6 @@ python src/lerobot/scripts/lerobot_train.py\
|
||||
- [lerobot/pi05_base](https://huggingface.co/lerobot/pi05_base)
|
||||
- [lerobot/pi05_libero](https://huggingface.co/lerobot/pi05_libero) (specifically trained on the Libero dataset)
|
||||
|
||||
### Training Parameters Explained
|
||||
|
||||
| Parameter | Default | Description |
|
||||
| ----------------------- | ------- | ------------------------------------------- |
|
||||
| `freeze_vision_encoder` | `false` | Do not freeze the vision encoder |
|
||||
| `train_expert_only` | `false` | Do not freeze the VLM, train all parameters |
|
||||
|
||||
**💡 Tip**: Setting `train_expert_only=true` freezes the VLM and trains only the action expert and projections, allowing finetuning with reduced memory usage.
|
||||
|
||||
If your dataset is not converted with `quantiles`, you can convert it with the following command:
|
||||
|
||||
```bash
|
||||
|
||||
@@ -1,246 +0,0 @@
|
||||
# π₀-FAST (Pi0-FAST)
|
||||
|
||||
π₀-FAST is a **Vision-Language-Action model for general robot control** that uses autoregressive next-token prediction to model continuous robot actions.
|
||||
|
||||
## Model Overview
|
||||
|
||||
π₀-FAST combines the power of Vision-Language Models with a novel action tokenization approach called **FAST (Frequency-space Action Sequence Tokenization)**. This enables training autoregressive VLAs on highly dexterous tasks that are impossible with standard binning-based discretization, while training **up to 5x faster** than diffusion-based approaches like π₀.
|
||||
|
||||
<img
|
||||
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/lerobot-pifast.png"
|
||||
alt="An overview of Pi0-FAST"
|
||||
width="85%"
|
||||
/>
|
||||
|
||||
### Why FAST?
|
||||
|
||||
Standard approaches for robot action tokenization use simple per-dimension, per-timestep binning schemes. While passable for simple behaviors, this rapidly breaks down for complex and dexterous skills that require precision and high-frequency control.
|
||||
|
||||
FAST solves this by compressing action sequences using signal processing techniques, resulting in a dense sequence of action tokens that can be predicted autoregressively—just like language tokens.
|
||||
|
||||
### How FAST Tokenization Works
|
||||
|
||||
The FAST tokenizer compresses action sequences through the following steps:
|
||||
|
||||
1. **Normalize**: Take a continuous action chunk of shape `(H, D)` where `H` is the horizon and `D` is the action dimension. Normalize using one of the supported normalization methods (Quantiles recommended to handle outliers).
|
||||
|
||||
2. **Discrete Cosine Transform (DCT)**: Apply DCT (via scipy) to each action dimension separately. DCT is a compression algorithm commonly used in image and audio codecs (JPEG, MP3).
|
||||
|
||||
3. **Quantization**: Round and remove insignificant coefficients for each action dimension, producing a sparse frequency matrix.
|
||||
|
||||
4. **Flatten**: Flatten the matrix into a 1D vector, with low-frequency components first.
|
||||
|
||||
5. **Byte Pair Encoding (BPE)**: Train a BPE tokenizer to compress the DCT coefficients into dense action tokens, typically achieving **10x compression** over prior tokenization approaches.
|
||||
|
||||
This approach can transform **any existing VLM** into a VLA by training it to predict these FAST tokens.
|
||||
|
||||
## Installation Requirements
|
||||
|
||||
1. Install LeRobot by following our [Installation Guide](./installation).
|
||||
2. Install π₀-FAST dependencies by running:
|
||||
|
||||
```bash
|
||||
pip install -e ".[pi]"
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> For lerobot 0.4.0, if you want to install the pi tag, you will have to do: `pip install "lerobot[pi]@git+https://github.com/huggingface/lerobot.git"`.
|
||||
>
|
||||
> This will be solved in the next patch release
|
||||
|
||||
## Training a Custom FAST Tokenizer
|
||||
|
||||
You have two options for the FAST tokenizer:
|
||||
|
||||
1. **Use the pre-trained tokenizer**: The `physical-intelligence/fast` tokenizer was trained on 1M+ real robot action sequences and works as a general-purpose tokenizer.
|
||||
|
||||
2. **Train your own tokenizer**: For maximum performance on your specific dataset, you can finetune the tokenizer on your own data.
|
||||
|
||||
### Training Your Own Tokenizer
|
||||
|
||||
```bash
|
||||
lerobot-train-tokenizer \
|
||||
--repo_id "user/my-lerobot-dataset" \
|
||||
--action_horizon 10 \
|
||||
--encoded_dims "0:6" \
|
||||
--vocab_size 1024 \
|
||||
--scale 10.0 \
|
||||
--normalization_mode QUANTILES \
|
||||
--output_dir "./my_fast_tokenizer" \
|
||||
--push_to_hub \
|
||||
--hub_repo_id "username/my-action-tokenizer"
|
||||
```
|
||||
|
||||
### Key Tokenizer Parameters
|
||||
|
||||
| Parameter | Description | Default |
|
||||
| ---------------------- | --------------------------------------------------------------------------------- | ------------ |
|
||||
| `--repo_id` | LeRobot dataset repository ID | Required |
|
||||
| `--action_horizon` | Number of future actions in each chunk | `10` |
|
||||
| `--encoded_dims` | Comma-separated dimension ranges to encode (e.g., `"0:6,7:23"`) | `"0:6,7:23"` |
|
||||
| `--vocab_size` | BPE vocabulary size | `1024` |
|
||||
| `--scale` | DCT scaling factor for quantization | `10.0` |
|
||||
| `--normalization_mode` | Normalization mode (`MEAN_STD`, `MIN_MAX`, `QUANTILES`, `QUANTILE10`, `IDENTITY`) | `QUANTILES` |
|
||||
| `--sample_fraction` | Fraction of chunks to sample per episode | `0.1` |
|
||||
|
||||
## Usage
|
||||
|
||||
To use π₀-FAST in LeRobot, specify the policy type as:
|
||||
|
||||
```python
|
||||
policy.type=pi0_fast
|
||||
```
|
||||
|
||||
## Training
|
||||
|
||||
For training π₀-FAST, you can use the LeRobot training script:
|
||||
|
||||
```bash
|
||||
lerobot-train \
|
||||
--dataset.repo_id=your_dataset \
|
||||
--policy.type=pi0_fast \
|
||||
--output_dir=./outputs/pi0fast_training \
|
||||
--job_name=pi0fast_training \
|
||||
--policy.pretrained_path=lerobot/pi0_fast_base \
|
||||
--policy.dtype=bfloat16 \
|
||||
--policy.gradient_checkpointing=true \
|
||||
--policy.chunk_size=10 \
|
||||
--policy.n_action_steps=10 \
|
||||
--policy.max_action_tokens=256 \
|
||||
--steps=100000 \
|
||||
--batch_size=4 \
|
||||
--policy.device=cuda
|
||||
```
|
||||
|
||||
### Key Training Parameters
|
||||
|
||||
| Parameter | Description | Default |
|
||||
| -------------------------------------- | -------------------------------------------------- | ---------------------------- |
|
||||
| `--policy.gradient_checkpointing=true` | Reduces memory usage significantly during training | `false` |
|
||||
| `--policy.dtype=bfloat16` | Use mixed precision training for efficiency | `float32` |
|
||||
| `--policy.chunk_size` | Number of action steps to predict (action horizon) | `50` |
|
||||
| `--policy.n_action_steps` | Number of action steps to execute | `50` |
|
||||
| `--policy.max_action_tokens` | Maximum number of FAST tokens per action chunk | `256` |
|
||||
| `--policy.action_tokenizer_name` | FAST tokenizer to use | `physical-intelligence/fast` |
|
||||
| `--policy.compile_model=true` | Enable torch.compile for faster training | `false` |
|
||||
|
||||
## Inference
|
||||
|
||||
### KV-Caching for Fast Inference
|
||||
|
||||
π₀-FAST supports **KV-caching**, a widely used optimization in LLM inference. This caches the key-value pairs from the attention mechanism, avoiding redundant computation during autoregressive decoding.
|
||||
|
||||
```python
|
||||
# KV-caching is enabled by default
|
||||
policy.use_kv_cache=true
|
||||
```
|
||||
|
||||
### Inference Example
|
||||
|
||||
```python
|
||||
from lerobot.policies.pi0_fast import PI0FastPolicy, PI0FastConfig
|
||||
|
||||
# Load the policy
|
||||
policy = PI0FastPolicy.from_pretrained("your-model-path")
|
||||
|
||||
# During inference
|
||||
actions = policy.predict_action_chunk(batch)
|
||||
```
|
||||
|
||||
## Model Architecture
|
||||
|
||||
π₀-FAST uses a PaliGemma-based architecture:
|
||||
|
||||
- **Vision Encoder**: SigLIP vision tower for image understanding
|
||||
- **Language Model**: Gemma 2B for processing language instructions and predicting action tokens
|
||||
|
||||
The model takes images, text instructions, and robot state as input, and outputs discrete FAST tokens that are decoded back to continuous actions.
|
||||
|
||||
## Configuration Options
|
||||
|
||||
| Parameter | Description | Default |
|
||||
| -------------------- | ----------------------------------------------- | ---------- |
|
||||
| `paligemma_variant` | VLM backbone variant (`gemma_300m`, `gemma_2b`) | `gemma_2b` |
|
||||
| `max_state_dim` | Maximum state vector dimension (padded) | `32` |
|
||||
| `max_action_dim` | Maximum action vector dimension (padded) | `32` |
|
||||
| `temperature` | Sampling temperature (0.0 for greedy) | `0.0` |
|
||||
| `max_decoding_steps` | Maximum decoding steps | `256` |
|
||||
| `use_kv_cache` | Enable KV caching for faster inference | `true` |
|
||||
|
||||
## Comparison with π₀
|
||||
|
||||
| Feature | π₀ | π₀-FAST |
|
||||
| --------------------- | ------------------------- | ---------------------------- |
|
||||
| Action Representation | Flow Matching (Diffusion) | Autoregressive Tokens (FAST) |
|
||||
| Training Speed | 1x | **5x faster** |
|
||||
| Dexterity | High | High |
|
||||
| Inference Method | Iterative Denoising | Autoregressive Decoding |
|
||||
| KV-Caching | N/A | Supported |
|
||||
|
||||
## Reproducing π₀Fast results
|
||||
|
||||
We reproduce the results of π₀Fast on the LIBERO benchmark using the LeRobot implementation. We take the LeRobot PiFast base model [lerobot/pi0fast-base](https://huggingface.co/lerobot/pi0fast-base) and finetune for an additional 40kk steps in bfloat16, with batch size of 256 on 8 H100 GPUs using the [HuggingFace LIBERO dataset](https://huggingface.co/datasets/HuggingFaceVLA/libero).
|
||||
|
||||
The finetuned model can be found here:
|
||||
|
||||
- **π₀Fast LIBERO**: [lerobot/pi0fast-libero](https://huggingface.co/lerobot/pi0fast-libero)
|
||||
|
||||
With the following training command:
|
||||
|
||||
```bash
|
||||
lerobot-train \
|
||||
--dataset.repo_id=lerobot/libero \
|
||||
--output_dir=outputs/libero_pi0fast \
|
||||
--job_name=libero_pi0fast \
|
||||
--policy.path=lerobot/pi0fast_base \
|
||||
--policy.dtype=bfloat16 \
|
||||
--steps=100000 \
|
||||
--save_freq=20000 \
|
||||
--batch_size=4 \
|
||||
--policy.device=cuda \
|
||||
--policy.scheduler_warmup_steps=4000 \
|
||||
--policy.scheduler_decay_steps=100000 \
|
||||
--policy.scheduler_decay_lr=1e-5 \
|
||||
--policy.gradient_checkpointing=true \
|
||||
--policy.chunk_size=10 \
|
||||
--policy.n_action_steps=10 \
|
||||
--policy.max_action_tokens=256 \
|
||||
--policy.empty_cameras=1 \
|
||||
```
|
||||
|
||||
We then evaluate the finetuned model using the LeRobot LIBERO implementation, by running the following command:
|
||||
|
||||
```bash
|
||||
tasks="libero_object,libero_spatial,libero_goal,libero_10"
|
||||
lerobot-eval \
|
||||
--policy.path=lerobot/pi0fast-libero \
|
||||
--policy.max_action_tokens=256 \
|
||||
--env.type=libero \
|
||||
--policy.gradient_checkpointing=false \
|
||||
--env.task=${tasks} \
|
||||
--eval.batch_size=1 \
|
||||
--eval.n_episodes=1 \
|
||||
--rename_map='{"observation.images.image":"observation.images.base_0_rgb","observation.images.image2":"observation.images.left_wrist_0_rgb"}'
|
||||
```
|
||||
|
||||
**Note:** We set `n_action_steps=10`, similar to the original OpenPI implementation.
|
||||
|
||||
### Results
|
||||
|
||||
We obtain the following results on the LIBERO benchmark:
|
||||
|
||||
| Model | LIBERO Spatial | LIBERO Object | LIBERO Goal | LIBERO 10 | Average |
|
||||
| ----------- | -------------- | ------------- | ----------- | --------- | -------- |
|
||||
| **π₀-fast** | 70.0 | 100.0 | 100.0 | 60.0 | **82.5** |
|
||||
|
||||
The full evaluation output folder, including videos, is available [here](https://drive.google.com/drive/folders/1HXpwPTRm4hx6g1sF2P7OOqGG0TwPU7LQ?usp=sharing)
|
||||
|
||||
## License
|
||||
|
||||
This model follows the **Apache 2.0 License**, consistent with the original [OpenPI repository](https://github.com/Physical-Intelligence/openpi).
|
||||
|
||||
## References
|
||||
|
||||
- [FAST: Efficient Robot Action Tokenization](https://www.physicalintelligence.company/research/fast) - Physical Intelligence Blog
|
||||
- [OpenPI Repository](https://github.com/Physical-Intelligence/openpi) - Original implementation
|
||||
- [FAST Tokenizer on Hugging Face](https://huggingface.co/physical-intelligence/fast) - Pre-trained tokenizer
|
||||
@@ -1,45 +0,0 @@
|
||||
# WALL-OSS
|
||||
|
||||
This repository contains the Hugging Face port of [**WALL-OSS**](https://x2robot.com/en/research/68bc2cde8497d7f238dde690), a Vision-Language-Action model for cross-embodiment robotic control based on Qwen2.5-VL with flow matching/FAST action prediction.
|
||||
|
||||
---
|
||||
|
||||
## Model Overview
|
||||
|
||||
| Feature | Description |
|
||||
| ------------------ | ----------------------------------------------------- |
|
||||
| Base Model | Qwen2.5-VL (Vision-Language Model) |
|
||||
| Action Prediction | Flow Matching (diffusion) or FAST (discrete tokens) |
|
||||
| Architecture | Mixture of Experts (MoE) with action-specific routing |
|
||||
| Multi-Modal Inputs | Vision (images/videos), Language, Proprioception |
|
||||
|
||||
---
|
||||
|
||||
## Additional Resources
|
||||
|
||||
Paper: https://arxiv.org/pdf/2509.11766
|
||||
|
||||
Official Repository: https://github.com/X-Square-Robot/wall-x
|
||||
|
||||
Hugging Face: https://huggingface.co/x-square-robot
|
||||
|
||||
---
|
||||
|
||||
## Citation
|
||||
|
||||
If you use this work, please cite:
|
||||
|
||||
```bibtex
|
||||
@article{zhai2025igniting,
|
||||
title = {Igniting VLMs Toward the Embodied Space},
|
||||
author = {Zhai, Andy and Liu, Brae and Fang, Bruno and Cai, Chalse and Ma, Ellie and Yin, Ethan and Wang, Hao and Zhou, Hugo and Wang, James and Shi, Lights and Liang, Lucy and Wang, Make and Wang, Qian and Gan, Roy and Yu, Ryan and Li, Shalfun and Liu, Starrick and Chen, Sylas and Chen, Vincent and Xu, Zach},
|
||||
journal = {arXiv preprint arXiv:2509.11766},
|
||||
year = {2025}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## License
|
||||
|
||||
This model follows the **Apache 2.0 License**, consistent with the original [WallX repository](https://github.com/X-Square-Robot/wall-x).
|
||||
@@ -30,7 +30,7 @@ Each of these pipelines handle different conversions between different action an
|
||||
|
||||
Below is an example of the three pipelines that we use in the phone to SO-100 follower examples:
|
||||
|
||||
```python
|
||||
```69:90:examples/phone_so100_record.py
|
||||
phone_to_robot_ee_pose_processor = RobotProcessorPipeline[RobotAction, RobotAction]( # teleop -> dataset action
|
||||
steps=[
|
||||
MapPhoneActionToRobotAction(platform=teleop_config.phone_os),
|
||||
@@ -84,7 +84,7 @@ Dataset features are determined by the keys saved in the dataset. Each step can
|
||||
|
||||
Below is and example of how we declare features with the `transform_features` method in the phone to SO-100 follower examples:
|
||||
|
||||
```python
|
||||
```src/lerobot/robots/so100_follower/robot_kinematic_processor.py
|
||||
def transform_features(
|
||||
self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
|
||||
) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
|
||||
@@ -103,7 +103,7 @@ Here we declare what PolicyFeatures we modify in this step, so we know what feat
|
||||
|
||||
Below is an example of how we aggregate and merge features in the phone to SO-100 record example:
|
||||
|
||||
```python
|
||||
```121:145:examples/phone_so100_record.py
|
||||
features=combine_feature_dicts(
|
||||
# Run the feature contract of the pipelines
|
||||
# This tells you how the features would look like after the pipeline steps
|
||||
|
||||
@@ -38,7 +38,6 @@ docker run --rm -it \
|
||||
start_rviz:=true start_sdk_server:=true mujoco:=true
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> If MuJoCo runs slowly (low simulation frequency), append `-e LD_LIBRARY_PATH="/opt/host-libs:$LD_LIBRARY_PATH" \` to the previous command to improve performance:
|
||||
>
|
||||
> ```
|
||||
@@ -142,7 +141,7 @@ If you choose this option but still want to use the VR teleoperation application
|
||||
First add reachy2 and reachy2_teleoperator to the imports of the record script. Then you can use the following command:
|
||||
|
||||
```bash
|
||||
lerobot-record \
|
||||
python -m lerobot.record \
|
||||
--robot.type=reachy2 \
|
||||
--robot.ip_address=192.168.0.200 \
|
||||
--robot.id=r2-0000 \
|
||||
@@ -151,7 +150,6 @@ lerobot-record \
|
||||
--teleop.type=reachy2_teleoperator \
|
||||
--teleop.ip_address=192.168.0.200 \
|
||||
--teleop.with_mobile_base=false \
|
||||
--robot.with_torso_camera=true \
|
||||
--dataset.repo_id=pollen_robotics/record_test \
|
||||
--dataset.single_task="Reachy 2 recording test" \
|
||||
--dataset.num_episodes=1 \
|
||||
@@ -167,7 +165,7 @@ lerobot-record \
|
||||
**Extended setup overview (all options included):**
|
||||
|
||||
```bash
|
||||
lerobot-record \
|
||||
python -m lerobot.record \
|
||||
--robot.type=reachy2 \
|
||||
--robot.ip_address=192.168.0.200 \
|
||||
--robot.use_external_commands=true \
|
||||
@@ -179,8 +177,6 @@ lerobot-record \
|
||||
--robot.with_left_teleop_camera=true \
|
||||
--robot.with_right_teleop_camera=true \
|
||||
--robot.with_torso_camera=false \
|
||||
--robot.camera_width=640 \
|
||||
--robot.camera_height=480 \
|
||||
--robot.disable_torque_on_disconnect=false \
|
||||
--robot.max_relative_target=5.0 \
|
||||
--teleop.type=reachy2_teleoperator \
|
||||
@@ -216,10 +212,9 @@ Must be set to true if a compliant Reachy 2 is used to control another one.
|
||||
From our initial tests, recording **all** joints when only some are moving can reduce model quality with certain policies.
|
||||
To avoid this, you can exclude specific parts from recording and replay using:
|
||||
|
||||
```bash
|
||||
````
|
||||
--robot.with_<part>=false
|
||||
```
|
||||
|
||||
```,
|
||||
with `<part>` being one of : `mobile_base`, `l_arm`, `r_arm", `neck`, `antennas`.
|
||||
It determine whether the corresponding part is recorded in the observations. True if not set.
|
||||
|
||||
@@ -227,60 +222,49 @@ By default, **all parts are recorded**.
|
||||
|
||||
The same per-part mechanism is available in `reachy2_teleoperator` as well.
|
||||
|
||||
```bash
|
||||
--teleop.with\_<part>
|
||||
```
|
||||
````
|
||||
|
||||
--teleop.with\_<part>
|
||||
|
||||
```
|
||||
with `<part>` being one of : `mobile_base`, `l_arm`, `r_arm", `neck`, `antennas`.
|
||||
Determine whether the corresponding part is recorded in the actions. True if not set.
|
||||
|
||||
> **Important:** In a given session, the **enabled parts must match** on both the robot and the teleoperator.
|
||||
> For example, if the robot runs with `--robot.with_mobile_base=false`, the teleoperator must disable the same part `--teleoperator.with_mobile_base=false`.
|
||||
For example, if the robot runs with `--robot.with_mobile_base=false`, the teleoperator must disable the same part `--teleoperator.with_mobile_base=false`.
|
||||
|
||||
##### Use the relevant cameras
|
||||
|
||||
You can do the same for **cameras**. Enable or disable each camera with default parameters using:
|
||||
You can do the same for **cameras**. By default, only the **teleoperation cameras** are recorded (both `left_teleop_camera` and `right_teleop_camera`). Enable or disable each camera with:
|
||||
|
||||
```bash
|
||||
--robot.with_left_teleop_camera=<true|false> \
|
||||
--robot.with_right_teleop_camera=<true|false> \
|
||||
```
|
||||
|
||||
--robot.with_left_teleop_camera=<true|false>
|
||||
--robot.with_right_teleop_camera=<true|false>
|
||||
--robot.with_torso_camera=<true|false>
|
||||
```
|
||||
|
||||
By default, no camera is recorded, all camera arguments are set to `false`.
|
||||
If you want to, you can use custom `width` and `height` parameters for Reachy 2's cameras using the `--robot.camera_width` & `--robot.camera_height` argument:
|
||||
````
|
||||
|
||||
```bash
|
||||
--robot.camera_width=1920 \
|
||||
--robot.camera_height=1080
|
||||
```
|
||||
|
||||
This will change the resolution of all 3 default robot cameras (enabled by the above bool arguments).
|
||||
|
||||
If you want, you can add additional cameras other than the ones in the robot as usual with:
|
||||
|
||||
```bash
|
||||
--robot.cameras="{ extra: {type: opencv, index_or_path: 42, width: 640, height: 480, fps: 30}}" \
|
||||
```
|
||||
|
||||
## Step 2: Replay
|
||||
|
||||
Make sure the robot is configured with the same parts as the dataset:
|
||||
|
||||
```bash
|
||||
lerobot-replay \
|
||||
python -m lerobot.replay \
|
||||
--robot.type=reachy2 \
|
||||
--robot.ip_address=192.168.0.200 \
|
||||
--robot.use_external_commands=false \
|
||||
--robot.with_mobile_base=false \
|
||||
--dataset.repo_id=pollen_robotics/record_test \
|
||||
--dataset.episode=0
|
||||
```
|
||||
--display_data=true
|
||||
````
|
||||
|
||||
## Step 3: Train
|
||||
|
||||
```bash
|
||||
lerobot-train \
|
||||
python -m lerobot.scripts.train \
|
||||
--dataset.repo_id=pollen_robotics/record_test \
|
||||
--policy.type=act \
|
||||
--output_dir=outputs/train/reachy2_test \
|
||||
@@ -293,9 +277,10 @@ lerobot-train \
|
||||
## Step 4: Evaluate
|
||||
|
||||
```bash
|
||||
lerobot-eval \
|
||||
python -m lerobot.record \
|
||||
--robot.type=reachy2 \
|
||||
--robot.ip_address=192.168.0.200 \
|
||||
--display_data=false \
|
||||
--dataset.repo_id=pollen_robotics/eval_record_test \
|
||||
--dataset.single_task="Evaluate reachy2 policy" \
|
||||
--dataset.num_episodes=10 \
|
||||
|
||||
@@ -1,592 +0,0 @@
|
||||
# SARM: Stage-Aware Reward Modeling
|
||||
|
||||
SARM (Stage-Aware Reward Modeling) is a video-based reward modeling framework for long-horizon robot manipulation tasks. This guide covers how to train SARM reward models and optionally use them with Reward-Aligned Behavior Cloning (RA-BC).
|
||||
|
||||
**Paper**: [SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation](https://arxiv.org/abs/2509.25358)
|
||||
|
||||
<img
|
||||
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/lerobot-sarm.png"
|
||||
alt="An overview of SARM"
|
||||
width="80%"
|
||||
/>
|
||||
|
||||
## Why Reward Models?
|
||||
|
||||
Standard behavior cloning treats all demonstration frames equally, but real-world robot datasets are messy. They contain hesitations, corrections, and variable-quality trajectories. Reward models solve this by learning a generalizable notion of **task progress** from demonstrations: given video frames and a task description, they predict how close the robot is to completing the task (0→1). This learned "progress signal" can be used in multiple ways, two promising applications are: (1) **weighted imitation learning** (RA-BC), where high-progress frames receive more weight during policy training, and (2) **reinforcement learning**, where the reward model provides dense rewards for online or offline policy improvement.
|
||||
|
||||
## Overview
|
||||
|
||||
SARM has following features:
|
||||
|
||||
1. **Stage-aware architecture**: Jointly predicts the high-level task stage and fine-grained progress within each stage
|
||||
2. **Subtask annotations**: Uses natural language subtask annotations to derive consistent progress labels
|
||||
3. **Temporal proportions**: Computes dataset-level priors (α̅\_k) for each subtask to normalize progress across variable-length demonstrations
|
||||
|
||||
SARM trains on a compact **stage+tau** target for each frame:
|
||||
|
||||
- **stage**: integer stage index `k ∈ {0, ..., K-1}`
|
||||
- **τ (tau)**: within-stage progress `τ ∈ [0, 1]`
|
||||
- **target encoding**: `y = k + τ` (this is what the dataset processor produces)
|
||||
|
||||
At inference time (and in downstream RA-BC), SARM converts the raw `k + τ` value into a **normalized progress** in `[0, 1]` using dataset-level **temporal proportions** `α̅_k` (stored in `meta/temporal_proportions_*.json`).
|
||||
|
||||
This matches **Formula (2)** from the paper:
|
||||
|
||||
```
|
||||
progress_t = P_{k-1} + α̅_k × τ_t
|
||||
```
|
||||
|
||||
Where:
|
||||
|
||||
- `τ_t = (t - s_k) / (e_k - s_k)` is within-subtask normalized time
|
||||
- `P_{k-1}` is cumulative prior (sum of previous subtask proportions)
|
||||
- `α̅_k` is the temporal proportion for subtask k
|
||||
|
||||
This ensures identical task states map to consistent progress values, even across demonstrations of different lengths.
|
||||
|
||||
## Inputs and Targets (What the new code expects)
|
||||
|
||||
SARM is trained through its processor (`src/lerobot/policies/sarm/processor_sarm.py`), which:
|
||||
|
||||
- **Encodes** images and task text with CLIP (ViT-B/32) into `video_features` and `text_features`
|
||||
- **Pads/truncates** robot state into `state_features` (up to `max_state_dim`)
|
||||
- **Builds targets** as `sparse_targets` (and `dense_targets` in `dense_only`/`dual`) using the stage+tau encoding `y = k + τ`
|
||||
- **Masks rewind frames** using a per-sample `lengths` tensor (rewind is a training-time augmentation)
|
||||
|
||||
At minimum, each training sample needs:
|
||||
|
||||
- `task` (string): task description
|
||||
- `policy.image_key` images and `policy.state_key` states from the dataset
|
||||
|
||||
---
|
||||
|
||||
## Annotation Modes
|
||||
|
||||
You can choose from **3 annotation modes** that determine how progress labels are computed:
|
||||
|
||||
| Mode | Annotations Required | Heads | Use Case |
|
||||
| -------------- | -------------------- | ---------------------------- | ------------------------------------------------------------ |
|
||||
| `single_stage` | None | Sparse only | Simple tasks, quick experiments, no VLM needed |
|
||||
| `dense_only` | Dense (VLM) | Dual (sparse auto-generated) | Detailed subtask tracking without defining high-level stages |
|
||||
| `dual` | Sparse + Dense (VLM) | Dual | Full SARM paper setup with both granularities |
|
||||
|
||||
### Mode Details
|
||||
|
||||
<hfoptions id="mode_explanation">
|
||||
<hfoption id="single_stage">
|
||||
|
||||
**No annotations required.** The entire episode is treated as a single stage called `"task"`, and progress is linear from 0 to 1 over the episode duration.
|
||||
|
||||
- **Sparse head**: 1 stage ("task"), linear progress
|
||||
- **Dense head**: Not used
|
||||
- **Best for**: Simple tasks, quick experiments, or when VLM annotation is not available
|
||||
|
||||
## Set Up Your Environment
|
||||
|
||||
1. Install LeRobot by following our [Installation Guide](./installation).
|
||||
2. Install SARM dependencies by running:
|
||||
|
||||
```bash
|
||||
pip install -e ".[sarm]"
|
||||
```
|
||||
|
||||
Workflow:
|
||||
|
||||
```
|
||||
1. Train SARM → 2. Visualize predictions → 3. (Optional) Train policy with RA-BC
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="dense_only">
|
||||
|
||||
**Only dense (fine-grained) annotations from a VLM.** The sparse head automatically uses a single `"task"` stage covering the full episode, while the dense head learns detailed subtask progression.
|
||||
|
||||
- **Sparse head**: 1 stage ("task"), linear progress (auto-generated)
|
||||
- **Dense head**: Multiple fine-grained stages from VLM annotations
|
||||
- **Best for**: When you want detailed subtask tracking but don't need to define high-level stages
|
||||
|
||||
Workflow:
|
||||
|
||||
```
|
||||
1. Annotate (dense) → 2. Verify → 3. Train SARM → 4. Visualize → 5. (Optional) Train policy with RA-BC
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="dual">
|
||||
|
||||
**Both sparse and dense annotations from VLM.** Full dual-head mode as described in the SARM paper, with both high-level (sparse) and fine-grained (dense) stage predictions.
|
||||
|
||||
- **Sparse head**: High-level stages from VLM annotations
|
||||
- **Dense head**: Fine-grained stages from VLM annotations
|
||||
- **Best for**: Complex multi-stage tasks where both granularities are useful
|
||||
|
||||
Workflow:
|
||||
|
||||
```
|
||||
1. Annotate (sparse+dense) → 2. Verify → 3. Train SARM → 4. Visualize → 5. (Optional) Train policy with RA-BC
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
---
|
||||
|
||||
## Step 1: Subtask Annotation
|
||||
|
||||
<hfoptions id="annotation_mode">
|
||||
<hfoption id="single_stage">
|
||||
|
||||
**No annotation required!** Skip this step entirely. The model will use the episode's task description and compute linear progress automatically.
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="dense_only">
|
||||
|
||||
Generate **dense (fine-grained) annotations only** using a VLM. The sparse stage will be auto-generated.
|
||||
|
||||
```bash
|
||||
python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \
|
||||
--repo-id your-username/your-dataset \
|
||||
--dense-only \
|
||||
--dense-subtasks "Bring robot arms up from starting position,Grab near side and do 1st fold,Grab side and do 2nd fold,Grab side and do 3rd fold to finish folding" \
|
||||
--video-key observation.images.base \
|
||||
--num-workers 4 \
|
||||
--push-to-hub
|
||||
```
|
||||
|
||||
**What gets saved:**
|
||||
|
||||
- `meta/temporal_proportions_sparse.json` - Auto-generated sparse proportions (`{"task": 1.0}`)
|
||||
- `meta/temporal_proportions_dense.json` - Dense temporal proportions
|
||||
- Per-episode columns in `episodes/*.parquet`:
|
||||
- `dense_subtask_names`, `dense_subtask_start_frames`, `dense_subtask_end_frames`
|
||||
- (also time-based columns: `dense_subtask_start_times`, `dense_subtask_end_times`)
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="dual">
|
||||
|
||||
Generate **both sparse (high-level) and dense (fine-grained) annotations** using a VLM.
|
||||
|
||||
```bash
|
||||
python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \
|
||||
--repo-id your-username/your-dataset \
|
||||
--sparse-subtasks "Bring arms up from starting position,Fold the towel (3 folds in total)" \
|
||||
--dense-subtasks "Bring robot arms up from starting position,Grab near side and do 1st fold,Grab side and do 2nd fold,Grab side and do 3rd fold to finish folding" \
|
||||
--video-key observation.images.base \
|
||||
--num-workers 4 \
|
||||
--push-to-hub
|
||||
```
|
||||
|
||||
**What gets saved:**
|
||||
|
||||
- `meta/temporal_proportions_sparse.json` - Sparse temporal proportions
|
||||
- `meta/temporal_proportions_dense.json` - Dense temporal proportions
|
||||
- Per-episode columns in `episodes/*.parquet`:
|
||||
- `sparse_subtask_names`, `sparse_subtask_start_frames`, `sparse_subtask_end_frames`
|
||||
- `dense_subtask_names`, `dense_subtask_start_frames`, `dense_subtask_end_frames`
|
||||
- (also time-based columns: `*_subtask_start_times`, `*_subtask_end_times`)
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
### Annotation Arguments
|
||||
|
||||
| Argument | Description |
|
||||
| ---------------------- | ------------------------------------------------------------------------------- |
|
||||
| `--repo-id` | HuggingFace dataset repository ID |
|
||||
| `--sparse-subtasks` | Comma-separated list of high-level subtask names |
|
||||
| `--dense-subtasks` | Comma-separated list of fine-grained subtask names |
|
||||
| `--dense-only` | Generate only dense annotations (auto-creates sparse "task" stage) |
|
||||
| `--video-key` | Camera/video key to use (e.g., `observation.images.top`) |
|
||||
| `--num-workers` | Number of parallel GPU workers (default: 1) |
|
||||
| `--episodes` | Specific episode indices to annotate (default: all) |
|
||||
| `--skip-existing` | Skip episodes that already have annotations |
|
||||
| `--model` | VLM model (default: `Qwen/Qwen3-VL-30B-A3B-Instruct`) |
|
||||
| `--num-visualizations` | Number of episodes to visualize after annotation (default: 5, set to 0 to skip) |
|
||||
|
||||
> **Note**: After annotation completes, 5 episodes are automatically visualized by default. Use `--num-visualizations 0` to skip this step.
|
||||
|
||||
---
|
||||
|
||||
## Step 2: Verify Annotations
|
||||
|
||||
<hfoptions id="verify_mode">
|
||||
<hfoption id="single_stage">
|
||||
|
||||
**No verification needed!** Skip this step.
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="dense_only">
|
||||
|
||||
Visualize annotations using the `--visualize-only` flag:
|
||||
|
||||
```bash
|
||||
python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \
|
||||
--repo-id your-username/your-dataset \
|
||||
--visualize-only \
|
||||
--visualize-type dense \
|
||||
--num-visualizations 5 \
|
||||
--video-key observation.images.base \
|
||||
--output-dir ./subtask_viz
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="dual">
|
||||
|
||||
Visualize annotations using the `--visualize-only` flag:
|
||||
|
||||
```bash
|
||||
python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \
|
||||
--repo-id your-username/your-dataset \
|
||||
--visualize-only \
|
||||
--visualize-type both \
|
||||
--num-visualizations 5 \
|
||||
--video-key observation.images.base \
|
||||
--output-dir ./subtask_viz
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
This generates visualizations showing video frames with subtask boundaries overlaid and timeline of subtasks.
|
||||
|
||||
### Visualization Arguments
|
||||
|
||||
| Argument | Description |
|
||||
| ---------------------- | -------------------------------------------------------------- |
|
||||
| `--visualize-only` | Only visualize existing annotations (no generation) |
|
||||
| `--num-visualizations` | Number of episodes to visualize (default: 5) |
|
||||
| `--visualize-type` | Type of annotations to visualize: `sparse`, `dense`, or `both` |
|
||||
|
||||
**Tip**: If annotations are inaccurate, adjust your subtask descriptions to be more specific and re-run.
|
||||
|
||||
---
|
||||
|
||||
## Step 3: Train SARM
|
||||
|
||||
<hfoptions id="train_mode">
|
||||
<hfoption id="single_stage">
|
||||
|
||||
Train with **no annotations** - uses linear progress from 0 to 1:
|
||||
|
||||
```bash
|
||||
python src/lerobot/scripts/lerobot_train.py \
|
||||
--dataset.repo_id=your-username/your-dataset \
|
||||
--policy.type=sarm \
|
||||
--policy.annotation_mode=single_stage \
|
||||
--policy.image_key=observation.images.base \
|
||||
--output_dir=outputs/train/sarm_single \
|
||||
--batch_size=32 \
|
||||
--steps=5000 \
|
||||
--wandb.enable=true \
|
||||
--wandb.project=sarm \
|
||||
--policy.repo_id=your-username/your-model-name
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="dense_only">
|
||||
|
||||
Train with **dense annotations only** (sparse auto-generated):
|
||||
|
||||
```bash
|
||||
python src/lerobot/scripts/lerobot_train.py \
|
||||
--dataset.repo_id=your-username/your-dataset \
|
||||
--policy.type=sarm \
|
||||
--policy.annotation_mode=dense_only \
|
||||
--policy.image_key=observation.images.base \
|
||||
--output_dir=outputs/train/sarm_dense \
|
||||
--batch_size=32 \
|
||||
--steps=5000 \
|
||||
--wandb.enable=true \
|
||||
--wandb.project=sarm \
|
||||
--policy.repo_id=your-username/your-model-name
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="dual">
|
||||
|
||||
Train with **both sparse and dense annotations**:
|
||||
|
||||
```bash
|
||||
python src/lerobot/scripts/lerobot_train.py \
|
||||
--dataset.repo_id=your-username/your-dataset \
|
||||
--policy.type=sarm \
|
||||
--policy.annotation_mode=dual \
|
||||
--policy.image_key=observation.images.base \
|
||||
--output_dir=outputs/train/sarm_dual \
|
||||
--batch_size=32 \
|
||||
--steps=5000 \
|
||||
--wandb.enable=true \
|
||||
--wandb.project=sarm \
|
||||
--policy.repo_id=your-username/your-model-name
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
### Multi-GPU Training
|
||||
|
||||
Add `accelerate launch --multi_gpu --num_processes=4` to use multiple GPUs for training.
|
||||
|
||||
### Training Arguments
|
||||
|
||||
| Argument | Description | Default |
|
||||
| -------------------------- | ----------------------------------------------------------------- | ------------------------ |
|
||||
| `--policy.annotation_mode` | `single_stage`, `dense_only`, or `dual` | `single_stage` |
|
||||
| `--policy.image_key` | Camera key for images | `observation.images.top` |
|
||||
| `--policy.state_key` | Key for joint states | `observation.state` |
|
||||
| `--policy.n_obs_steps` | Observation history steps (total obs frames = `n_obs_steps + 1`) | `8` |
|
||||
| `--policy.frame_gap` | Gap (in frames) between sampled observations (at 30 fps: 30 ≈ 1s) | `30` |
|
||||
|
||||
---
|
||||
|
||||
## Step 4: Visualize Predictions
|
||||
|
||||
Use `compute_rabc_weights.py` with `--visualize-only` to visualize model predictions (and, if available, annotation-derived targets) without writing a parquet file.
|
||||
|
||||
<hfoptions id="viz_mode">
|
||||
<hfoption id="single_stage">
|
||||
|
||||
```bash
|
||||
python src/lerobot/policies/sarm/compute_rabc_weights.py \
|
||||
--dataset-repo-id your-username/your-dataset \
|
||||
--reward-model-path your-username/sarm-model \
|
||||
--visualize-only \
|
||||
--num-visualizations 5 \
|
||||
--head-mode sparse \
|
||||
--output-dir ./sarm_viz
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="dense_only">
|
||||
|
||||
```bash
|
||||
python src/lerobot/policies/sarm/compute_rabc_weights.py \
|
||||
--dataset-repo-id your-username/your-dataset \
|
||||
--reward-model-path your-username/sarm-model \
|
||||
--visualize-only \
|
||||
--num-visualizations 5 \
|
||||
--head-mode dense \
|
||||
--output-dir ./sarm_viz
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="dual">
|
||||
|
||||
```bash
|
||||
python src/lerobot/policies/sarm/compute_rabc_weights.py \
|
||||
--dataset-repo-id your-username/your-dataset \
|
||||
--reward-model-path your-username/sarm-model \
|
||||
--visualize-only \
|
||||
--num-visualizations 5 \
|
||||
--head-mode both \
|
||||
--output-dir ./sarm_viz
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
The visualization shows:
|
||||
|
||||
- **Progress plot**: Predicted progress (and optional annotation-derived “GT” when available and `--stride 1`)
|
||||
- **Stage probabilities**: Stacked area plot of predicted stage probabilities
|
||||
- **Sample frames**: Key frames from the episode with progress/stage labels
|
||||
|
||||
### Visualization Arguments
|
||||
|
||||
| Argument | Description |
|
||||
| ---------------------- | --------------------------------------------------------- |
|
||||
| `--visualize-only` | Only visualize predictions (no RABC computation) |
|
||||
| `--num-visualizations` | Number of episodes to visualize (default: 5) |
|
||||
| `--head-mode` | SARM head to use: `sparse`, `dense`, or `both` |
|
||||
| `--stride` | Compute every N frames, interpolate the rest (default: 1) |
|
||||
|
||||
---
|
||||
|
||||
## Step 5 (Optional): Train Policy with RA-BC
|
||||
|
||||
Reward-Aligned Behavior Cloning (RA-BC) uses the trained SARM model to weight training samples based on predicted progress improvement. This requires two steps:
|
||||
|
||||
1. **Precompute progress values** for all frames using the trained SARM model
|
||||
2. **Train policy** with RA-BC weighting using the precomputed values
|
||||
|
||||
### How RA-BC Works
|
||||
|
||||
For each training sample, RA-BC computes the progress delta:
|
||||
|
||||
```
|
||||
r_i = φ(o_{t+Δ}) - φ(o_t)
|
||||
```
|
||||
|
||||
Where `φ` is the SARM progress prediction and `Δ` is the policy's `chunk_size`. Samples with positive progress (good demonstrations) get higher weights, while samples with negative or zero progress get down-weighted.
|
||||
|
||||
The weighting follows **Equations 8-9** from the paper:
|
||||
|
||||
- **Soft weight**: `w̃_i = clip((r_i − (μ − 2σ)) / (4σ + ε), 0, 1)`
|
||||
- **Final weight**: `w_i = 𝟙{r_i > κ} + 𝟙{0 ≤ r_i ≤ κ} × w̃_i`
|
||||
|
||||
### Step 5a: Compute SARM Progress Values
|
||||
|
||||
First, run the SARM model on all frames in your dataset to compute progress values:
|
||||
|
||||
```bash
|
||||
python src/lerobot/policies/sarm/compute_rabc_weights.py \
|
||||
--dataset-repo-id your-username/your-dataset \
|
||||
--reward-model-path your-username/sarm-model \
|
||||
--head-mode sparse \
|
||||
--num-visualizations 5 \
|
||||
--push-to-hub
|
||||
```
|
||||
|
||||
This script:
|
||||
|
||||
- Processes all frames and computes progress values
|
||||
- Saves progress values to a parquet file next to the dataset on disk (defaults to `<dataset_root>/sarm_progress.parquet`)
|
||||
- Generates visualizations of the first N episodes (default: 5)
|
||||
|
||||
**Arguments:**
|
||||
|
||||
| Argument | Description | Default |
|
||||
| ---------------------- | -------------------------------------------------------------- | ---------- |
|
||||
| `--reward-model-path` | Path to trained SARM model | (required) |
|
||||
| `--head-mode` | SARM head to use: `sparse`, `dense`, or `both` | `sparse` |
|
||||
| `--device` | Device for inference | `cuda` |
|
||||
| `--visualize-only` | Only visualize predictions (no RA-BC computation) | `false` |
|
||||
| `--num-visualizations` | Number of episodes to visualize (default: 5, set to 0 to skip) | `5` |
|
||||
|
||||
**Output format** (`sarm_progress.parquet`):
|
||||
|
||||
| Column | Description |
|
||||
| ----------------- | ---------------------------------------------- |
|
||||
| `index` | Global frame index in dataset |
|
||||
| `episode_index` | Episode number |
|
||||
| `frame_index` | Local frame index within episode |
|
||||
| `progress_sparse` | Sparse head progress value [0, 1] |
|
||||
| `progress_dense` | Dense head progress value [0, 1] (if computed) |
|
||||
|
||||
### Step 5b: Train Policy with RA-BC
|
||||
|
||||
Once you have the progress file, train your policy with RA-BC weighting. The progress file is auto-detected from the dataset path (`sarm_progress.parquet`). Currently PI0, PI0.5 and SmolVLA are supported with RA-BC:
|
||||
|
||||
```bash
|
||||
python src/lerobot/scripts/lerobot_train.py \
|
||||
--dataset.repo_id=your-username/your-dataset \
|
||||
--policy.type=pi0 \
|
||||
--use_rabc=true \
|
||||
--rabc_head_mode=sparse \
|
||||
--rabc_kappa=0.01 \
|
||||
--output_dir=outputs/train/policy_rabc \
|
||||
--batch_size=32 \
|
||||
--steps=40000
|
||||
```
|
||||
|
||||
The training script automatically:
|
||||
|
||||
- Loads the precomputed progress values from the parquet file
|
||||
- Uses the policy's `chunk_size` to compute progress deltas (Δ)
|
||||
- Computes sample weights based on progress improvement
|
||||
- Applies weighted loss during training
|
||||
|
||||
**RA-BC Arguments:**
|
||||
|
||||
| Argument | Description | Default |
|
||||
| ---------------------- | ---------------------------------------------------------- | ---------------------------------- |
|
||||
| `--use_rabc` | Enable RA-BC sample weighting | `false` |
|
||||
| `--rabc_progress_path` | Path to progress parquet file (auto-detected from dataset) | `sarm_progress.parquet` in dataset |
|
||||
| `--rabc_head_mode` | Which SARM head's progress to use: `sparse` or `dense` | `sparse` |
|
||||
| `--rabc_kappa` | Threshold κ for high-quality samples | `0.01` |
|
||||
|
||||
### Tuning RA-BC Kappa
|
||||
|
||||
The `kappa` parameter is the threshold that determines which samples get full weight (w=1). Understanding how to tune it is critical for RA-BC to work effectively.
|
||||
|
||||
**How the weighting works:**
|
||||
|
||||
| Condition | Weight |
|
||||
| ------------------- | ----------------------- |
|
||||
| `delta > kappa` | 1.0 (hard threshold) |
|
||||
| `0 ≤ delta ≤ kappa` | Soft weight from Eq. 8 |
|
||||
| `delta < 0` | 0.0 (negative progress) |
|
||||
|
||||
**Diagnosing kappa issues:**
|
||||
|
||||
Monitor these WandB metrics during training:
|
||||
|
||||
| Metric | Healthy Range | Problem Indicator |
|
||||
| ------------------ | ------------- | ------------------------- |
|
||||
| `rabc_mean_weight` | 0.3 - 0.8 | ≈ 1.0 means kappa too low |
|
||||
| `rabc_delta_mean` | > 0 | Should be positive |
|
||||
| `rabc_delta_std` | > 0 | Variance in data quality |
|
||||
|
||||
**If `rabc_mean_weight ≈ 1.0`:** Your kappa is too low. Most samples have `delta > kappa` and bypass the soft-weighting entirely. RA-BC becomes equivalent to vanilla BC.
|
||||
|
||||
**Setting kappa based on your data:**
|
||||
|
||||
The default `kappa=0.01` was tuned for the paper's T-shirt folding task (~90s episodes at 30fps). For your dataset, check the logged `rabc_delta_mean` and `rabc_delta_std`:
|
||||
|
||||
```
|
||||
# If delta_mean ≈ 0.03 and delta_std ≈ 0.02:
|
||||
# Most deltas fall in range [0.01, 0.05]
|
||||
|
||||
# Option 1: Set kappa = delta_mean (medium selectivity)
|
||||
--rabc_kappa=0.03
|
||||
|
||||
# Option 2: Set kappa = delta_mean + delta_std (high selectivity)
|
||||
--rabc_kappa=0.05
|
||||
|
||||
# Option 3: Set kappa = delta_mean + 2*delta_std (very selective)
|
||||
--rabc_kappa=0.07
|
||||
```
|
||||
|
||||
**When RA-BC may not help:**
|
||||
|
||||
If your dataset is already high quality (consistent progress across all demonstrations), RA-BC won't provide much benefit since there's nothing to filter.
|
||||
|
||||
### Multi-GPU Training with RA-BC
|
||||
|
||||
```bash
|
||||
accelerate launch \
|
||||
--multi_gpu \
|
||||
--num_processes=4 \
|
||||
src/lerobot/scripts/lerobot_train.py \
|
||||
--dataset.repo_id=your-username/your-dataset \
|
||||
--policy.type=pi0 \
|
||||
--use_rabc=true \
|
||||
--rabc_kappa=0.01 \
|
||||
--output_dir=outputs/train/policy_rabc \
|
||||
--batch_size=32 \
|
||||
--steps=40000
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Tips & Best Practices
|
||||
|
||||
### Choosing a Mode
|
||||
|
||||
- **Start with `single_stage`** for quick experiments - no annotation overhead
|
||||
- Use **`dense_only`** when you want detailed progress tracking but tasks don't have clear high-level stages
|
||||
- Use **`dual`** for complex tasks where both coarse and fine-grained progress is meaningful
|
||||
|
||||
### Annotation Quality
|
||||
|
||||
1. **Be specific with subtask names**: Instead of "fold", use "grab near side and fold toward center"
|
||||
2. **Verify with visualization**: Always check a few episodes before training
|
||||
3. **Consistent naming**: Use the same subtask names across all episodes
|
||||
|
||||
### RA-BC
|
||||
|
||||
1. **Train SARM first**: RA-BC quality depends entirely on SARM quality
|
||||
2. **Monitor `rabc_mean_weight`**: If it's ≈ 1.0, increase kappa (see [Tuning RA-BC Kappa](#tuning-ra-bc-kappa))
|
||||
|
||||
---
|
||||
|
||||
## Citation
|
||||
|
||||
```bibtex
|
||||
@article{chen2025sarm,
|
||||
title={SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation},
|
||||
author={Chen, Qianzhong and Yu, Justin and Schwager, Mac and Abbeel, Pieter and Shentu, Yide and Wu, Philipp},
|
||||
journal={arXiv preprint arXiv:2509.25358},
|
||||
year={2025}
|
||||
}
|
||||
```
|
||||
@@ -103,7 +103,7 @@ lerobot-setup-motors \
|
||||
|
||||
<!-- prettier-ignore-start -->
|
||||
```python
|
||||
from lerobot.robots.so_follower import SO100Follower, SO100FollowerConfig
|
||||
from lerobot.robots.so100_follower import SO100Follower, SO100FollowerConfig
|
||||
|
||||
config = SO100FollowerConfig(
|
||||
port="/dev/tty.usbmodem585A0076841",
|
||||
@@ -177,7 +177,7 @@ lerobot-setup-motors \
|
||||
|
||||
<!-- prettier-ignore-start -->
|
||||
```python
|
||||
from lerobot.teleoperators.so_leader import SO100Leader, SO100LeaderConfig
|
||||
from lerobot.teleoperators.so100_leader import SO100Leader, SO100LeaderConfig
|
||||
|
||||
config = SO100LeaderConfig(
|
||||
port="/dev/tty.usbmodem585A0076841",
|
||||
@@ -579,7 +579,7 @@ lerobot-calibrate \
|
||||
|
||||
<!-- prettier-ignore-start -->
|
||||
```python
|
||||
from lerobot.robots.so_follower import SO100FollowerConfig, SO100Follower
|
||||
from lerobot.robots.so100_follower import SO100FollowerConfig, SO100Follower
|
||||
|
||||
config = SO100FollowerConfig(
|
||||
port="/dev/tty.usbmodem585A0076891",
|
||||
@@ -617,7 +617,7 @@ lerobot-calibrate \
|
||||
|
||||
<!-- prettier-ignore-start -->
|
||||
```python
|
||||
from lerobot.teleoperators.so_leader import SO100LeaderConfig, SO100Leader
|
||||
from lerobot.teleoperators.so100_leader import SO100LeaderConfig, SO100Leader
|
||||
|
||||
config = SO100LeaderConfig(
|
||||
port="/dev/tty.usbmodem58760431551",
|
||||
|
||||
@@ -125,7 +125,7 @@ lerobot-setup-motors \
|
||||
|
||||
<!-- prettier-ignore-start -->
|
||||
```python
|
||||
from lerobot.robots.so_follower import SO101Follower, SO101FollowerConfig
|
||||
from lerobot.robots.so101_follower import SO101Follower, SO101FollowerConfig
|
||||
|
||||
config = SO101FollowerConfig(
|
||||
port="/dev/tty.usbmodem585A0076841",
|
||||
@@ -201,7 +201,7 @@ lerobot-setup-motors \
|
||||
|
||||
<!-- prettier-ignore-start -->
|
||||
```python
|
||||
from lerobot.teleoperators.so_leader import SO101Leader, SO101LeaderConfig
|
||||
from lerobot.teleoperators.so101_leader import SO101Leader, SO101LeaderConfig
|
||||
|
||||
config = SO101LeaderConfig(
|
||||
port="/dev/tty.usbmodem585A0076841",
|
||||
@@ -364,7 +364,7 @@ lerobot-calibrate \
|
||||
|
||||
<!-- prettier-ignore-start -->
|
||||
```python
|
||||
from lerobot.robots.so_follower import SO101FollowerConfig, SO101Follower
|
||||
from lerobot.robots.so101_follower import SO101FollowerConfig, SO101Follower
|
||||
|
||||
config = SO101FollowerConfig(
|
||||
port="/dev/tty.usbmodem585A0076891",
|
||||
@@ -413,7 +413,7 @@ lerobot-calibrate \
|
||||
|
||||
<!-- prettier-ignore-start -->
|
||||
```python
|
||||
from lerobot.teleoperators.so_leader import SO101LeaderConfig, SO101Leader
|
||||
from lerobot.teleoperators.so101_leader import SO101LeaderConfig, SO101Leader
|
||||
|
||||
config = SO101LeaderConfig(
|
||||
port="/dev/tty.usbmodem58760431551",
|
||||
|
||||
@@ -1,21 +1,20 @@
|
||||
# Unitree G1
|
||||
# Unitree G1 Robot Setup and Control
|
||||
|
||||
This guide covers the complete setup process for the Unitree G1 humanoid, from initial connection to running gr00t_wbc locomotion.
|
||||
|
||||
## About
|
||||
## About the Unitree G1
|
||||
|
||||
We support both 29 and 23 DOF G1 EDU version. We introduce:
|
||||
We offer support for both 29 and 23 DOF G1. In this first PR we introduce:
|
||||
|
||||
- **`unitree g1` robot class, handling low level read/write from/to the humanoid**
|
||||
- **ZMQ socket bridge** for remote communication and camera streaming, allowing for remote policy deployment over wlan, eth or directly on the robot
|
||||
- **Locomotion policies** from NVIDIA gr00t and Amazon FAR Holosoma
|
||||
- **Simulation mode** for testing policies without the physical robot in mujoco
|
||||
- **`unitree g1` robot class, handling low level communication with the humanoid**
|
||||
- **ZMQ socket bridge** for remote communication over WiFi, allowing one to deploy policies remotely instead of over ethernet or directly on the Orin
|
||||
- **GR00T locomotion policy** for bipedal walking and balance
|
||||
|
||||
---
|
||||
|
||||
## Connection guide
|
||||
## Part 1: Connect to Robot over Ethernet
|
||||
|
||||
### Step 1: Configure Ethernet Interface
|
||||
### Step 1: Configure Your Computer's Ethernet Interface
|
||||
|
||||
Set a static IP on the same subnet as the robot:
|
||||
|
||||
@@ -26,7 +25,7 @@ sudo ip addr add 192.168.123.200/24 dev enp131s0
|
||||
sudo ip link set enp131s0 up
|
||||
```
|
||||
|
||||
**Note**: The G1's Ethernet IP is fixed at `192.168.123.164`. Your computer must use `192.168.123.x` with x ≠ 164.
|
||||
**Note**: The robot's Ethernet IP is fixed at `192.168.123.164`. Your computer must use `192.168.123.x` where x ≠ 164.
|
||||
|
||||
### Step 2: SSH into the Robot
|
||||
|
||||
@@ -35,24 +34,25 @@ ssh unitree@192.168.123.164
|
||||
# Password: 123
|
||||
```
|
||||
|
||||
You should now be connected to the G1's Orin.
|
||||
You should now be connected to the robot's onboard computer.
|
||||
|
||||
---
|
||||
|
||||
## Part 2: Enable WiFi on the Robot
|
||||
|
||||
Wlan0 is disabled by default on the G1. To enable it:
|
||||
Once connected via Ethernet, follow these steps to enable WiFi:
|
||||
|
||||
### Step 1: Enable WiFi Hardware
|
||||
|
||||
```bash
|
||||
# Unblock WiFi radio
|
||||
sudo rfkill unblock wifi
|
||||
sudo rfkill unblock all
|
||||
|
||||
# Bring up wlan0
|
||||
# Bring up WiFi interface
|
||||
sudo ip link set wlan0 up
|
||||
|
||||
# Enable NetworkManager control of wlan0
|
||||
# Enable NetworkManager control
|
||||
sudo nmcli radio wifi on
|
||||
sudo nmcli device set wlan0 managed yes
|
||||
sudo systemctl restart NetworkManager
|
||||
@@ -72,7 +72,7 @@ sudo iptables -A FORWARD -i wlp132s0f0 -o enp131s0 -m state --state RELATED,ESTA
|
||||
sudo iptables -A FORWARD -i enp131s0 -o wlp132s0f0 -j ACCEPT
|
||||
```
|
||||
|
||||
**On the G1:**
|
||||
**On the robot:**
|
||||
|
||||
```bash
|
||||
# Add laptop as default gateway
|
||||
@@ -110,7 +110,7 @@ ssh unitree@<YOUR_ROBOT_IP>
|
||||
# Password: 123
|
||||
```
|
||||
|
||||
Replace `<YOUR_ROBOT_IP>` with your robot's actual WiFi IP address.
|
||||
Replace `<YOUR_ROBOT_IP>` with your robot's actual WiFi IP address (e.g., `172.18.129.215`).
|
||||
|
||||
---
|
||||
|
||||
@@ -146,9 +146,9 @@ python src/lerobot/robots/unitree_g1/run_g1_server.py
|
||||
|
||||
---
|
||||
|
||||
## Part 4: Controlling the robot
|
||||
## Part 4: Running GR00T Locomotion
|
||||
|
||||
With the robot server running, you can now control the robot remotely. Let's launch a locomotion policy
|
||||
With the robot server running, you can now control the robot from your laptop.
|
||||
|
||||
### Step 1: Install LeRobot on your machine
|
||||
|
||||
@@ -171,30 +171,30 @@ Edit the config file to match your robot's WiFi IP:
|
||||
robot_ip: str = "<YOUR_ROBOT_IP>" # Replace with your robot's WiFi IP.
|
||||
```
|
||||
|
||||
**Note**: When running directly on the G1 (not remotely), set `robot_ip: str = "127.0.0.1"` instead.
|
||||
|
||||
### Step 3: Run the Locomotion Policy
|
||||
|
||||
```bash
|
||||
# Run GR00T locomotion controller
|
||||
python examples/unitree_g1/gr00t_locomotion.py --repo-id "nepyope/GR00T-WholeBodyControl_g1"
|
||||
|
||||
# Run Holosoma locomotion controller
|
||||
python examples/unitree_g1/holosoma_locomotion.py
|
||||
|
||||
```
|
||||
|
||||
### Step 4: Control with Remote
|
||||
|
||||
- **Left stick**: Forward/backward and left/right movement
|
||||
- **Right stick**: Rotation
|
||||
- **R1 button**: Raise waist height
|
||||
- **R2 button**: Lower waist height
|
||||
|
||||
Press `Ctrl+C` to stop the policy.
|
||||
|
||||
---
|
||||
|
||||
## Running in Simulation Mode (MuJoCo)
|
||||
|
||||
You can now test policies before unleashing them on the physical robot using MuJoCo. To do so simply set `is_simulation=True` in config.
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- [Unitree SDK Documentation](https://github.com/unitreerobotics/unitree_sdk2_python)
|
||||
- [GR00T-WholeBodyControl](https://github.com/NVlabs/GR00T-WholeBodyControl)
|
||||
- [Holosoma](https://github.com/amazon-far/holosoma)
|
||||
- [GR00T Policy Repository](https://huggingface.co/nepyope/GR00T-WholeBodyControl_g1)
|
||||
- [LeRobot Documentation](https://github.com/huggingface/lerobot)
|
||||
- [Unitree_IL_Lerobot](https://github.com/unitreerobotics/unitree_IL_lerobot)
|
||||
|
||||
|
||||
@@ -11,14 +11,13 @@ LeRobot provides several utilities for manipulating datasets:
|
||||
3. **Merge Datasets** - Combine multiple datasets into one. The datasets must have identical features, and episodes are concatenated in the order specified in `repo_ids`
|
||||
4. **Add Features** - Add new features to a dataset
|
||||
5. **Remove Features** - Remove features from a dataset
|
||||
6. **Convert to Video** - Convert image-based datasets to video format for efficient storage
|
||||
|
||||
The core implementation is in `lerobot.datasets.dataset_tools`.
|
||||
An example script detailing how to use the tools API is available in `examples/dataset/use_dataset_tools.py`.
|
||||
|
||||
## Command-Line Tool: lerobot-edit-dataset
|
||||
|
||||
`lerobot-edit-dataset` is a command-line script for editing datasets. It can be used to delete episodes, split datasets, merge datasets, add features, remove features, and convert image datasets to video format.
|
||||
`lerobot-edit-dataset` is a command-line script for editing datasets. It can be used to delete episodes, split datasets, merge datasets, add features, and remove features.
|
||||
|
||||
Run `lerobot-edit-dataset --help` for more information on the configuration of each operation.
|
||||
|
||||
@@ -87,71 +86,9 @@ lerobot-edit-dataset \
|
||||
--operation.feature_names "['observation.images.top']"
|
||||
```
|
||||
|
||||
#### Convert to Video
|
||||
|
||||
Convert an image-based dataset to video format, creating a new LeRobotDataset where images are stored as videos. This is useful for reducing storage requirements and improving data loading performance. The new dataset will have the exact same structure as the original, but with images encoded as MP4 videos in the proper LeRobot format.
|
||||
|
||||
```bash
|
||||
# Local-only: Save to a custom output directory (no hub push)
|
||||
lerobot-edit-dataset \
|
||||
--repo_id lerobot/pusht_image \
|
||||
--operation.type convert_to_video \
|
||||
--operation.output_dir /path/to/output/pusht_video
|
||||
|
||||
# Save with new repo_id (local storage)
|
||||
lerobot-edit-dataset \
|
||||
--repo_id lerobot/pusht_image \
|
||||
--new_repo_id lerobot/pusht_video \
|
||||
--operation.type convert_to_video
|
||||
|
||||
# Convert and push to Hugging Face Hub
|
||||
lerobot-edit-dataset \
|
||||
--repo_id lerobot/pusht_image \
|
||||
--new_repo_id lerobot/pusht_video \
|
||||
--operation.type convert_to_video \
|
||||
--push_to_hub true
|
||||
|
||||
# Convert with custom video codec and quality settings
|
||||
lerobot-edit-dataset \
|
||||
--repo_id lerobot/pusht_image \
|
||||
--operation.type convert_to_video \
|
||||
--operation.output_dir outputs/pusht_video \
|
||||
--operation.vcodec libsvtav1 \
|
||||
--operation.pix_fmt yuv420p \
|
||||
--operation.g 2 \
|
||||
--operation.crf 30
|
||||
|
||||
# Convert only specific episodes
|
||||
lerobot-edit-dataset \
|
||||
--repo_id lerobot/pusht_image \
|
||||
--operation.type convert_to_video \
|
||||
--operation.output_dir outputs/pusht_video \
|
||||
--operation.episode_indices "[0, 1, 2, 5, 10]"
|
||||
|
||||
# Convert with multiple workers for parallel processing
|
||||
lerobot-edit-dataset \
|
||||
--repo_id lerobot/pusht_image \
|
||||
--operation.type convert_to_video \
|
||||
--operation.output_dir outputs/pusht_video \
|
||||
--operation.num_workers 8
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
|
||||
- `output_dir`: Custom output directory (optional - by default uses `new_repo_id` or `{repo_id}_video`)
|
||||
- `vcodec`: Video codec to use - options: `h264`, `hevc`, `libsvtav1` (default: `libsvtav1`)
|
||||
- `pix_fmt`: Pixel format - options: `yuv420p`, `yuv444p` (default: `yuv420p`)
|
||||
- `g`: Group of pictures (GOP) size - lower values give better quality but larger files (default: 2)
|
||||
- `crf`: Constant rate factor - lower values give better quality but larger files, 0 is lossless (default: 30)
|
||||
- `fast_decode`: Fast decode tuning option (default: 0)
|
||||
- `episode_indices`: List of specific episodes to convert (default: all episodes)
|
||||
- `num_workers`: Number of parallel workers for processing (default: 4)
|
||||
|
||||
**Note:** The resulting dataset will be a proper LeRobotDataset with all cameras encoded as videos in the `videos/` directory, with parquet files containing only metadata (no raw image data). All episodes, stats, and tasks are preserved.
|
||||
|
||||
### Push to Hub
|
||||
|
||||
Add the `--push_to_hub true` flag to any command to automatically upload the resulting dataset to the Hugging Face Hub:
|
||||
Add the `--push_to_hub` flag to any command to automatically upload the resulting dataset to the Hugging Face Hub:
|
||||
|
||||
```bash
|
||||
lerobot-edit-dataset \
|
||||
@@ -159,45 +96,7 @@ lerobot-edit-dataset \
|
||||
--new_repo_id lerobot/pusht_after_deletion \
|
||||
--operation.type delete_episodes \
|
||||
--operation.episode_indices "[0, 2, 5]" \
|
||||
--push_to_hub true
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
There is also a tool for adding features to a dataset that is not yet covered in `lerobot-edit-dataset`.
|
||||
|
||||
# Dataset Visualization
|
||||
|
||||
## Online Visualization
|
||||
|
||||
When you record a dataset using `lerobot`, it automatically uploads to the Hugging Face Hub unless you specify otherwise. To view the dataset online, use our **LeRobot Dataset Visualizer**, available at:
|
||||
https://huggingface.co/spaces/lerobot/visualize_dataset
|
||||
|
||||
## Local Visualization
|
||||
|
||||
You can also visualize episodes from a dataset locally using our command-line tool.
|
||||
|
||||
**From the Hugging Face Hub:**
|
||||
|
||||
```bash
|
||||
lerobot-dataset-viz \
|
||||
--repo-id lerobot/pusht \
|
||||
--episode-index 0
|
||||
```
|
||||
|
||||
**From a local folder:**
|
||||
Add the `--root` option and set `--mode local`. For example, to search in `./my_local_data_dir/lerobot/pusht`:
|
||||
|
||||
```bash
|
||||
lerobot-dataset-viz \
|
||||
--repo-id lerobot/pusht \
|
||||
--root ./my_local_data_dir \
|
||||
--mode local \
|
||||
--episode-index 0
|
||||
```
|
||||
|
||||
Once executed, the tool opens `rerun.io` and displays the camera streams, robot states, and actions for the selected episode.
|
||||
|
||||
For advanced usage—including visualizing datasets stored on a remote server—run:
|
||||
|
||||
```bash
|
||||
lerobot-dataset-viz --help
|
||||
```
|
||||
|
||||
@@ -1,80 +0,0 @@
|
||||
# WALL-OSS
|
||||
|
||||
WALL-OSS is an open-source foundation model for embodied intelligence, proposed by the [XSquare Robot](https://x2robot.com/en/research/68bc2cde8497d7f238dde690) team in 2025. The LeRobot implementation is adapted from their open-source [WallX](https://github.com/X-Square-Robot/wall-x) repository.
|
||||
|
||||
X Square Robot’s WALL-OSS is now integrated into Hugging Face’s LeRobot ecosystem. This is an exciting collaborative project between the LeRobot and X Square Robot teams. You can now post-train, evaluate, and deploy WALL-OSS directly through LeRobot. With this, we’re aiming to make it easier for the open-source robotics community to customize and deploy WALL-OSS foundation models. Read and explore WALL-OSS [paper](https://arxiv.org/pdf/2509.11766) and [code](https://github.com/X-Square-Robot/wall-x).
|
||||
|
||||
## Model Overview
|
||||
|
||||
The WALL-OSS team is building the embodied foundation model to capture and compress the world's most valuable data: the continuous, high-fidelity stream of physical interaction. By creating a direct feedback loop between the model's decisions and the body's lived experience, the emergence of a truly generalizable intelligence is enabled—one that understands not just how the world works, but how to act effectively within it.
|
||||
|
||||
<img
|
||||
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/walloss-lerobot-paper.png"
|
||||
alt="An overview of WALL-OSS"
|
||||
width="85%"
|
||||
/>
|
||||
|
||||
Technically, WALL-OSS introduces a tightly coupled multimodal architecture (tightly-coupled MoE structure) that integrates both discrete and continuous action modeling strategies. Through a two-stage training pipeline (Inspiration → Integration), the model gradually unifies semantic reasoning and high-frequency action generation. Its core innovations include:
|
||||
|
||||
- **Embodied perception–enhanced multimodal pretraining**: Large-scale training on unified vision–language–action data to strengthen spatial, causal, and manipulation understanding.
|
||||
- **Unified Cross-Level Chain-of-Thought (Uni-CoT)**: A single differentiable framework that unifies high-level instruction reasoning, sub-task decomposition, and fine-grained action synthesis, forming a continuous chain from “understanding” to “execution.”
|
||||
- **Mixture-of-Experts (MoE) action heads**: Dynamically activating experts depending on the task phase and modeling actions in discrete or continuous space to maintain stable VLM priors.
|
||||
- **Two-stage training paradigm**:
|
||||
- **Inspiration stage**: Injecting discrete action priors to strengthen spatial understanding and semantic-action alignment.
|
||||
- **Integration stage**: Using flow matching to achieve high-frequency continuous control.
|
||||
|
||||
## Installation Requirements
|
||||
|
||||
1. Install LeRobot by following our [Installation Guide](./installation).
|
||||
2. Install WallX dependencies by running:
|
||||
|
||||
```bash
|
||||
pip install -e ".[wallx]"
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
To use WallX in LeRobot, specify the policy type as:
|
||||
|
||||
```python
|
||||
policy.type=wall_x
|
||||
```
|
||||
|
||||
## Training
|
||||
|
||||
For training WallX, you can use the standard LeRobot training script with the appropriate configuration:
|
||||
|
||||
```bash
|
||||
python src/lerobot/scripts/lerobot_train.py \
|
||||
--dataset.repo_id=your_dataset \
|
||||
--policy.type=wall_x \
|
||||
--output_dir=./outputs/wallx_training \
|
||||
--job_name=wallx_training \
|
||||
--policy.repo_id=your_repo_id \
|
||||
--policy.pretrained_name_or_path=x-square-robot/wall-oss-flow \
|
||||
--policy.prediction_mode=diffusion \
|
||||
--policy.attn_implementation=eager \
|
||||
--steps=3000 \
|
||||
--policy.device=cuda \
|
||||
--batch_size=32
|
||||
```
|
||||
|
||||
### Training Arguments
|
||||
|
||||
| Argument | Description |
|
||||
| ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `--dataset.repo_id` | The Hugging Face Hub repository ID for your training dataset (e.g., `lerobot/aloha_sim_insertion_human`) |
|
||||
| `--policy.type` | Specifies using the WallX policy architecture |
|
||||
| `--output_dir` | Local directory where training checkpoints and logs will be saved |
|
||||
| `--job_name` | A name identifier for this training run (used in logging/tracking) |
|
||||
| `--policy.repo_id` | Your Hugging Face Hub repo ID where the trained model will be pushed |
|
||||
| `--policy.pretrained_path` | Path to pretrained WallX weights to initialize from (the official WALL-OSS checkpoint) |
|
||||
| `--policy.prediction_mode` | The action prediction strategy: `diffusion` or `fast` - `diffusion` uses iterative denoising for action generation, `fast` uses next token prediction instead |
|
||||
| `--policy.attn_implementation` | Attention implementation backend - `eager` uses standard PyTorch attention (alternatives include `flash_attention_2` or `sdpa`) |
|
||||
| `--steps` | Total number of training steps to run |
|
||||
| `--policy.device` | Device to train on (`cuda` for GPU, `cpu` for CPU) |
|
||||
| `--batch_size` | Number of samples per training batch |
|
||||
|
||||
## License
|
||||
|
||||
This model follows the **Apache 2.0 License**, consistent with the original [WallX repository](https://github.com/X-Square-Robot/wall-x).
|
||||
@@ -24,7 +24,7 @@ Built from pure Transformer encoders, X-VLA scales naturally with model size and
|
||||
<img
|
||||
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/xvla-architecture2.png"
|
||||
alt="XVLA Architecture 2"
|
||||
style="width: 60%; height: auto;"
|
||||
style="width: 32%; max-width: 450px; height: auto;"
|
||||
/>
|
||||
</p>
|
||||
|
||||
@@ -120,7 +120,7 @@ Adapted for Google Robot platforms.
|
||||
|
||||
### Recommended Training Configuration
|
||||
|
||||
When fine-tuning X-VLA for a new embodiment or task, we recommend not freezing the VLM, and also setting the `policy.dtype=bfloat16` to not hit OOM errors.
|
||||
When fine-tuning X-VLA for a new embodiment or task, we recommend the following freezing strategy:
|
||||
|
||||
```bash
|
||||
lerobot-train \
|
||||
@@ -129,26 +129,25 @@ lerobot-train \
|
||||
--job_name=xvla_training \
|
||||
--policy.path="lerobot/xvla-base" \
|
||||
--policy.repo_id="HF_USER/xvla-your-robot" \
|
||||
--policy.dtype=bfloat16 \
|
||||
--policy.action_mode=auto \
|
||||
--steps=20000 \
|
||||
--steps=3000 \
|
||||
--policy.device=cuda \
|
||||
--policy.freeze_vision_encoder=false \
|
||||
--policy.freeze_language_encoder=false \
|
||||
--policy.train_policy_transformer=true \
|
||||
--policy.train_soft_prompts=true \
|
||||
--policy.freeze_vision_encoder=True \
|
||||
--policy.freeze_language_encoder=True \
|
||||
--policy.train_policy_transformer=True \
|
||||
--policy.train_soft_prompts=True \
|
||||
--policy.action_mode=YOUR_ACTION_MODE
|
||||
```
|
||||
|
||||
### Training Parameters Explained
|
||||
|
||||
| Parameter | Default | Description |
|
||||
| -------------------------- | ------- | ---------------------------------------------- |
|
||||
| `freeze_vision_encoder` | `false` | Do not freeze the VLM vision encoder weights |
|
||||
| `freeze_language_encoder` | `false` | Do not freeze the VLM language encoder weights |
|
||||
| `train_policy_transformer` | `true` | Allow policy transformer layers to train |
|
||||
| `train_soft_prompts` | `true` | Allow soft prompts to train |
|
||||
| Parameter | Default | Description |
|
||||
| -------------------------- | ------- | ---------------------------------------- |
|
||||
| `freeze_vision_encoder` | `True` | Freeze the VLM vision encoder weights |
|
||||
| `freeze_language_encoder` | `True` | Freeze the VLM language encoder weights |
|
||||
| `train_policy_transformer` | `True` | Allow policy transformer layers to train |
|
||||
| `train_soft_prompts` | `True` | Allow soft prompts to train |
|
||||
|
||||
**💡 Best Practice**: For Phase II adaptation to new embodiments, do not freeze the VLM encoders and also train the policy transformer and soft prompts.
|
||||
**💡 Best Practice**: For Phase II adaptation to new embodiments, freeze the VLM encoders and only train the policy transformer and soft prompts. This provides excellent sample efficiency with minimal compute.
|
||||
|
||||
### Example: Training on Bimanual Robot
|
||||
|
||||
@@ -158,15 +157,14 @@ lerobot-train \
|
||||
--output_dir=./outputs/xvla_bimanual \
|
||||
--job_name=xvla_so101_training \
|
||||
--policy.path="lerobot/xvla-base" \
|
||||
--policy.dtype=bfloat16 \
|
||||
--policy.repo_id="YOUR_USERNAME/xvla-biso101" \
|
||||
--steps=3000 \
|
||||
--policy.device=cuda \
|
||||
--policy.action_mode=so101_bimanual \
|
||||
--policy.freeze_vision_encoder=false \
|
||||
--policy.freeze_language_encoder=false \
|
||||
--policy.train_policy_transformer=true \
|
||||
--policy.train_soft_prompts=true
|
||||
--policy.freeze_vision_encoder=True \
|
||||
--policy.freeze_language_encoder=True \
|
||||
--policy.train_policy_transformer=True \
|
||||
--policy.train_soft_prompts=True
|
||||
```
|
||||
|
||||
💡 **Best Performance:** If you have sufficient computational resources and want to achieve best X-VLA finetuning performance, you should follow the official finetuning strategy:
|
||||
@@ -174,7 +172,71 @@ lerobot-train \
|
||||
**🔥 Full-finetune all components with a custom learning-rate scheme**
|
||||
|
||||
To ensure stable optimization, the Vision-Language Model (VLM) must be trained with only 1/10 of the base learning rate, while all other components use the full LR.
|
||||
This LR ratio is crucial for achieving strong and stable finetuning performance. This is already done for you by default.
|
||||
This LR ratio is crucial for achieving strong and stable finetuning performance.
|
||||
To enable this behavior, you must:
|
||||
|
||||
1. Implement a custom optimizer and register it in your training config
|
||||
|
||||
```
|
||||
from dataclasses import dataclass, asdict
|
||||
from lerobot.optim.optimizers import OptimizerConfig
|
||||
import torch
|
||||
|
||||
@OptimizerConfig.register_subclass("xvla-adamw")
|
||||
@dataclass
|
||||
class XVLAAdamW(OptimizerConfig):
|
||||
lr: float = 1e-4
|
||||
betas: tuple[float, float] = (0.9, 0.99)
|
||||
eps: float = 1e-8
|
||||
weight_decay: float = 0.0
|
||||
grad_clip_norm: float = 10.0
|
||||
|
||||
def build(self, params: dict) -> torch.optim.Optimizer:
|
||||
"""
|
||||
Expect `named_parameters()` as input.
|
||||
Apply lr = lr / 10 for all VLM-related parameters.
|
||||
"""
|
||||
assert isinstance(params, dict), \
|
||||
"Custom LR optimizer requires `named_parameters()` as inputs."
|
||||
kwargs = asdict(self)
|
||||
kwargs.pop("grad_clip_norm")
|
||||
vlm_group, other_group = [], []
|
||||
for name, p in params.items():
|
||||
if not p.requires_grad:
|
||||
continue
|
||||
if "vlm" in name.lower():
|
||||
vlm_group.append(p)
|
||||
else:
|
||||
other_group.append(p)
|
||||
|
||||
param_groups = [
|
||||
{"params": vlm_group, "lr": self.lr * 0.1, "weight_decay": self.weight_decay * 0.1},
|
||||
{"params": other_group, "lr": self.lr, "weight_decay": self.weight_decay},
|
||||
]
|
||||
|
||||
return torch.optim.AdamW(param_groups, **kwargs)
|
||||
```
|
||||
|
||||
2. Modify X-VLA’s get_optim_params to return named parameters
|
||||
|
||||
Replace:
|
||||
|
||||
```
|
||||
def get_optim_params(self) -> dict:
|
||||
"""Return only trainable parameters for optimization."""
|
||||
return filter(lambda p: p.requires_grad, self.parameters())
|
||||
```
|
||||
|
||||
with:
|
||||
|
||||
```
|
||||
def get_optim_params(self):
|
||||
"""Return trainable named parameters."""
|
||||
return filter(lambda kv: kv[1].requires_grad, self.named_parameters())
|
||||
```
|
||||
|
||||
This ensures the optimizer receives a dict of named parameters, allowing it to correctly detect VLM modules and apply the 1/10 LR rule.
|
||||
|
||||
❕Note
|
||||
|
||||
Completely matching the official reported performance may require an additional warm-up LR schedule for soft-prompts, which can bring minor improvements.
|
||||
@@ -264,26 +326,6 @@ domain_id = 3
|
||||
|
||||
The domain_id is automatically added to observations by the `XVLAAddDomainIdProcessorStep` in the preprocessing pipeline.
|
||||
|
||||
The `lerobot/xvla-base` model has been trained on the following domain IDs. It is recommended to choose one that most resembles your robot/configuration:
|
||||
|
||||
#### Fine-tuning Datasets
|
||||
|
||||
| Dataset Name | Domain ID |
|
||||
| ---------------- | --------- |
|
||||
| Bridge | 0 |
|
||||
| RT1 | 1 |
|
||||
| Calvin | 2 |
|
||||
| libero | 3 |
|
||||
| widowx-air | 4 |
|
||||
| AIR-AGILEX-HQ | 5 |
|
||||
| robotwin2_abs_ee | 6 |
|
||||
| robotwin2_clean | 6 |
|
||||
| robocasa-human | 7 |
|
||||
| VLABench | 8 |
|
||||
| AGIBOT-challenge | 9 |
|
||||
| AIR-AGILEX | 10 |
|
||||
| AIRBOT | 18 |
|
||||
|
||||
### 3. Processor Steps
|
||||
|
||||
X-VLA requires specific preprocessing and postprocessing steps for proper operation.
|
||||
|
||||
@@ -41,7 +41,8 @@ from lerobot.robots import ( # noqa: F401
|
||||
RobotConfig,
|
||||
koch_follower,
|
||||
make_robot_from_config,
|
||||
so_follower,
|
||||
so100_follower,
|
||||
so101_follower,
|
||||
)
|
||||
from lerobot.utils.constants import ACTION
|
||||
from lerobot.utils.robot_utils import precise_sleep
|
||||
@@ -96,7 +97,7 @@ def replay(cfg: ReplayConfig):
|
||||
robot.send_action(action)
|
||||
|
||||
dt_s = time.perf_counter() - start_episode_t
|
||||
precise_sleep(max(1 / dataset.fps - dt_s, 0.0))
|
||||
precise_sleep(1 / dataset.fps - dt_s)
|
||||
|
||||
robot.disconnect()
|
||||
|
||||
|
||||
@@ -0,0 +1,243 @@
|
||||
# Synthetic Data Generation Script - Summary
|
||||
|
||||
## ✅ What Was Created
|
||||
|
||||
### Main Script: `annotate_pgen.py` (717 lines)
|
||||
A production-ready script implementing the Hi-Robot synthetic data generation pipeline.
|
||||
|
||||
**Key Features:**
|
||||
- ✅ Loads LeRobot datasets with skill annotations
|
||||
- ✅ Generates synthetic user prompts and robot utterances using Qwen VLM
|
||||
- ✅ **Temporal sampling** - generates dialogue every N seconds (default: 1s)
|
||||
- ✅ Adds `task_index_high_level` feature to dataset parquets
|
||||
- ✅ Saves high-level tasks to `meta/tasks_high_level.parquet`
|
||||
- ✅ Exports debug JSONL for quality analysis
|
||||
- ✅ Supports both Qwen2-VL and Qwen3-VL models
|
||||
- ✅ Multi-view camera support
|
||||
- ✅ Episode-aware processing with automatic first-frame sampling
|
||||
- ✅ Modular architecture for easy extension
|
||||
|
||||
### Supporting Files Created
|
||||
|
||||
1. **`run_pgen.sh`** - Convenience script with sensible defaults
|
||||
2. **`README_PGEN.md`** - Comprehensive documentation with examples
|
||||
3. **`example_pgen_usage.md`** - Practical examples and performance estimates
|
||||
4. **`SAMPLING_DIAGRAM.md`** - Visual explanation of temporal sampling strategy
|
||||
5. **`PGEN_SUMMARY.md`** - This file
|
||||
|
||||
## 🚀 Key Innovation: Temporal Sampling
|
||||
|
||||
The script processes **ALL episodes** in the dataset efficiently via `--sample-interval`:
|
||||
|
||||
```bash
|
||||
# Instead of calling VLM for every frame (expensive):
|
||||
# 15,000 frames × VLM call = ~5 hours
|
||||
|
||||
# Generate dialogue every 1 second (efficient):
|
||||
python annotate_pgen.py --repo-id dataset --model qwen --sample-interval 1.0
|
||||
# 15,000 frames processed, only ~500 VLM calls (30x speedup!)
|
||||
```
|
||||
|
||||
**How it works:**
|
||||
- Process ALL frames in ALL episodes (complete coverage)
|
||||
- Generate dialogue at sampled timepoints (e.g., every 1 second)
|
||||
- Propagate task indices to intermediate frames
|
||||
- Always sample first frame of each episode
|
||||
- All frames get labeled, but VLM is only called for samples
|
||||
- No dummy values or skipped episodes
|
||||
|
||||
**Benefits:**
|
||||
- 30-100x speedup depending on interval
|
||||
- Maintains temporal coherence
|
||||
- Reduces cost without losing quality
|
||||
- Configurable based on skill duration
|
||||
|
||||
## 📊 Efficiency Comparison
|
||||
|
||||
For a typical 15,000 frame dataset at 30 fps:
|
||||
|
||||
| Method | VLM Calls | Time | Cost |
|
||||
|--------|-----------|------|------|
|
||||
| Every frame | 15,000 | ~5 hours | $$$$ |
|
||||
| Every 0.5s | 1,000 | ~20 min | $$$ |
|
||||
| **Every 1s** (default) | **500** | **~10 min** | **$$** |
|
||||
| Every 2s | 250 | ~5 min | $ |
|
||||
|
||||
## 🎯 Usage
|
||||
|
||||
### Quick Test (5s sampling for fast iteration)
|
||||
```bash
|
||||
python examples/dataset/annotate_pgen.py \
|
||||
--data-dir /fsx/jade_choghari/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace \
|
||||
--model Qwen/Qwen2-VL-7B-Instruct \
|
||||
--sample-interval 5.0 \
|
||||
--output-dir ./outputs/test_quick
|
||||
```
|
||||
|
||||
### Production Run (Recommended Settings)
|
||||
```bash
|
||||
python examples/dataset/annotate_pgen.py \
|
||||
--data-dir /fsx/jade_choghari/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace \
|
||||
--model Qwen/Qwen2-VL-7B-Instruct \
|
||||
--sample-interval 1.0 \
|
||||
--output-dir ./outputs/full_pgen
|
||||
```
|
||||
|
||||
### High-Quality with Qwen3
|
||||
```bash
|
||||
python examples/dataset/annotate_pgen.py \
|
||||
--data-dir /fsx/jade_choghari/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace \
|
||||
--model Qwen/Qwen3-VL-30B-A3B-Instruct \
|
||||
--sample-interval 0.5 \
|
||||
--temperature 0.6 \
|
||||
--output-dir ./outputs/high_quality
|
||||
```
|
||||
|
||||
## 📦 Output Structure
|
||||
|
||||
After running, you'll have:
|
||||
|
||||
```
|
||||
dataset_root/
|
||||
├── meta/
|
||||
│ ├── tasks_high_level.parquet # High-level tasks with prompts/utterances
|
||||
│ └── syn_annotations.jsonl # Debug: full context for each sample
|
||||
└── data/
|
||||
└── chunk-000/
|
||||
└── file-000.parquet # Updated with task_index_high_level
|
||||
```
|
||||
|
||||
**New feature added to all parquet files:**
|
||||
- `task_index_high_level` (int64): Links to tasks_high_level.parquet
|
||||
|
||||
## 🔧 All Parameters
|
||||
|
||||
| Parameter | Default | Description |
|
||||
|-----------|---------|-------------|
|
||||
| `--repo-id` / `--data-dir` | - | Dataset source |
|
||||
| `--model` | Qwen/Qwen2-VL-7B-Instruct | VLM model |
|
||||
| `--device` | cuda | Device to use |
|
||||
| `--dtype` | bfloat16 | Model precision |
|
||||
| `--temperature` | 0.7 | Sampling temperature |
|
||||
| **`--sample-interval`** | **1.0** | **Generate every N seconds (all episodes processed)** |
|
||||
| `--num-image-views-per-sample` | 1 | Number of cameras |
|
||||
| `--batch-size` | 1 | Batch size (currently unused) |
|
||||
| `--output-dir` | None | Output directory |
|
||||
| `--push-to-hub` | False | Push to HuggingFace |
|
||||
|
||||
## 🎨 Generated Data Format
|
||||
|
||||
Each sampled frame produces:
|
||||
|
||||
```json
|
||||
{
|
||||
"scenario_type": "specific_object",
|
||||
"response_type": "confirmation",
|
||||
"user_prompt": "Can you pick up the pink brick?",
|
||||
"robot_utterance": "Sure, I'll grab the pink lego brick.",
|
||||
"skill": "robot arm picks up pink lego brick",
|
||||
"episode_id": 0,
|
||||
"frame_index": 45,
|
||||
"timestamp": 1.5,
|
||||
"skill_history": ["robot arm moves towards pink lego brick"],
|
||||
"task_description": "pink lego brick into the transparent box"
|
||||
}
|
||||
```
|
||||
|
||||
**Scenario Types:**
|
||||
- specific_object, negative_task, situated_correction, implicit_request, constraint_based
|
||||
|
||||
**Response Types:**
|
||||
- confirmation, clarification, acknowledgment, constraint_acknowledgment
|
||||
|
||||
## 🔬 Code Architecture
|
||||
|
||||
```python
|
||||
# Main components (modular design)
|
||||
|
||||
class QwenPgen:
|
||||
"""VLM wrapper supporting Qwen2/3"""
|
||||
def call_qwen(images, prompt) -> dict
|
||||
|
||||
def construct_prompt(task, history, skill) -> str:
|
||||
"""Build contextual prompt with history"""
|
||||
|
||||
def annotate_sample(pgen, images, ...) -> dict:
|
||||
"""Generate dialogue for one sample"""
|
||||
|
||||
def generate_synthetic_data(dataset, pgen, ...) -> tuple:
|
||||
"""Process entire dataset with temporal sampling"""
|
||||
# Core sampling logic:
|
||||
# - Track last_sample_timestamp per episode
|
||||
# - Sample if time_elapsed >= sample_interval
|
||||
# - Always sample first frame of episodes
|
||||
# - Propagate task_index to intermediate frames
|
||||
|
||||
def main():
|
||||
"""CLI entrypoint with argparse"""
|
||||
```
|
||||
|
||||
## ✨ Next Steps
|
||||
|
||||
1. **Quick test with large interval:**
|
||||
```bash
|
||||
# Fast iteration - samples every 5 seconds
|
||||
python examples/dataset/annotate_pgen.py \
|
||||
--data-dir /path/to/dataset \
|
||||
--model Qwen/Qwen2-VL-7B-Instruct \
|
||||
--sample-interval 5.0 \
|
||||
--output-dir ./outputs/quick_test
|
||||
```
|
||||
|
||||
2. **Verify output quality:**
|
||||
```bash
|
||||
head outputs/quick_test/meta/syn_annotations.jsonl
|
||||
```
|
||||
|
||||
3. **Production run:**
|
||||
```bash
|
||||
# Standard 1 second sampling for production
|
||||
bash examples/dataset/run_pgen.sh
|
||||
```
|
||||
|
||||
4. **Use in training:**
|
||||
```python
|
||||
from lerobot.datasets.lerobot_dataset import LeRobotDataset
|
||||
|
||||
ds = LeRobotDataset(repo_id="...", root="outputs/pgen_annotations")
|
||||
|
||||
# Access high-level task for each frame
|
||||
frame = ds[100]
|
||||
task_idx = frame["task_index_high_level"].item()
|
||||
```
|
||||
|
||||
## 📚 Documentation Files
|
||||
|
||||
- **`README_PGEN.md`**: Full API reference and troubleshooting
|
||||
- **`example_pgen_usage.md`**: Practical examples with performance estimates
|
||||
- **`SAMPLING_DIAGRAM.md`**: Visual explanation of temporal sampling
|
||||
- **`PGEN_SUMMARY.md`**: This overview document
|
||||
|
||||
## 🎯 Success Criteria
|
||||
|
||||
✅ Script generates synthetic dialogue using Qwen VLM
|
||||
✅ Adds `task_index_high_level` feature to dataset
|
||||
✅ Saves tasks to `tasks_high_level.parquet`
|
||||
✅ Implements efficient temporal sampling (30-100x speedup)
|
||||
✅ Handles episode boundaries correctly
|
||||
✅ Produces diverse interaction types (scenarios + responses)
|
||||
✅ Maintains temporal coherence within episodes
|
||||
✅ Includes comprehensive documentation and examples
|
||||
✅ Ready for production use on real datasets
|
||||
|
||||
## 💡 Key Takeaway
|
||||
|
||||
**The script processes ALL episodes with intelligent sampling:**
|
||||
- `--sample-interval` controls how often VLM is called (default: 1.0s)
|
||||
- ALL frames in ALL episodes get labeled (complete coverage)
|
||||
- Intermediate frames inherit from most recent sample (temporal coherence)
|
||||
- Achieves 30-100x speedup while maintaining quality
|
||||
- Adjust interval based on use case: 5.0s for testing, 1.0s for production, 0.5s for fine detail
|
||||
|
||||
This makes the synthetic data generation **practical, scalable, and complete** for real-world datasets!
|
||||
|
||||
@@ -0,0 +1,243 @@
|
||||
# Synthetic Data Generation for Hierarchical Robot Policies
|
||||
|
||||
This directory contains `annotate_pgen.py`, a script for generating synthetic user prompts and robot utterances for hierarchical policy training using Vision-Language Models (VLMs).
|
||||
|
||||
## Overview
|
||||
|
||||
The script implements the synthetic data generation pipeline described in the Hi-Robot paper:
|
||||
|
||||
1. **Load** a LeRobot dataset with skill annotations (from `annotate.py`)
|
||||
2. **Generate** synthetic dialogue using Qwen VLM:
|
||||
- User prompts (ℓ_t): Natural requests that lead to specific skills
|
||||
- Robot utterances (u_t): Acknowledgments and clarifications
|
||||
3. **Save** results as a new dataset feature `task_index_high_level`
|
||||
|
||||
## Prerequisites
|
||||
|
||||
1. First, annotate your dataset with skills using `annotate.py`:
|
||||
|
||||
```bash
|
||||
python examples/dataset/annotate.py \
|
||||
--repo-id lerobot/svla_so101_pickplace \
|
||||
--video-key observation.images.base \
|
||||
--model Qwen/Qwen2-VL-7B-Instruct
|
||||
```
|
||||
|
||||
This creates `meta/skills.json` with skill segmentation for each episode.
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```bash
|
||||
python examples/dataset/annotate_pgen.py \
|
||||
--repo-id lerobot/svla_so101_pickplace \
|
||||
--model Qwen/Qwen2-VL-7B-Instruct \
|
||||
--sample-interval 1.0 \
|
||||
--output-dir ./outputs/pgen_dataset
|
||||
```
|
||||
|
||||
**Note**: The script processes **all episodes** in the dataset. It generates dialogue every 1 second (`--sample-interval 1.0`) using temporal sampling. Frames between samples reuse the last generated dialogue. This makes the process efficient while ensuring complete dataset coverage.
|
||||
|
||||
### Advanced Options
|
||||
|
||||
```bash
|
||||
python examples/dataset/annotate_pgen.py \
|
||||
--repo-id lerobot/svla_so101_pickplace \
|
||||
--model Qwen/Qwen3-VL-30B-A3B-Instruct \
|
||||
--temperature 0.8 \
|
||||
--sample-interval 0.5 \
|
||||
--num-image-views-per-sample 2 \
|
||||
--output-dir ./outputs/pgen_dataset \
|
||||
--push-to-hub
|
||||
```
|
||||
|
||||
This example uses a more powerful model and samples every 0.5 seconds for finer granularity.
|
||||
|
||||
### Fast Testing (larger interval)
|
||||
|
||||
```bash
|
||||
python examples/dataset/annotate_pgen.py \
|
||||
--repo-id lerobot/svla_so101_pickplace \
|
||||
--model Qwen/Qwen2-VL-7B-Instruct \
|
||||
--sample-interval 5.0 \
|
||||
--output-dir ./outputs/pgen_quick_test
|
||||
```
|
||||
|
||||
Use a larger interval (5.0 seconds) for rapid iteration during development. All episodes are still processed.
|
||||
|
||||
### Using Local Dataset
|
||||
|
||||
```bash
|
||||
python examples/dataset/annotate_pgen.py \
|
||||
--data-dir /fsx/jade_choghari/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace \
|
||||
--model Qwen/Qwen2-VL-7B-Instruct \
|
||||
--output-dir ./outputs/pgen_dataset
|
||||
```
|
||||
|
||||
## Output Files
|
||||
|
||||
The script produces several outputs:
|
||||
|
||||
1. **`meta/tasks_high_level.parquet`**: High-level tasks with user prompts and robot utterances
|
||||
- Columns: task_index, user_prompt, robot_utterance, skill, scenario_type, response_type
|
||||
|
||||
2. **`meta/syn_annotations.jsonl`**: Debug file with all generated dialogues
|
||||
- One JSON object per line with full context for each frame
|
||||
|
||||
3. **Modified dataset**: New dataset with `task_index_high_level` feature added to all parquet files
|
||||
|
||||
## Scenario and Response Types
|
||||
|
||||
The generator produces diverse interaction types:
|
||||
|
||||
### Scenario Types
|
||||
- **specific_object**: Direct specification of objects/actions
|
||||
- **negative_task**: Instructions about what NOT to do
|
||||
- **situated_correction**: Adjustments based on current state
|
||||
- **implicit_request**: Implied needs without direct commands
|
||||
- **constraint_based**: Specific constraints or preferences
|
||||
|
||||
### Response Types
|
||||
- **confirmation**: Simple acknowledgment ("OK, I'll do X")
|
||||
- **clarification**: Seeking confirmation ("Just to confirm...")
|
||||
- **acknowledgment**: Action acknowledgment ("Got it, doing X")
|
||||
- **constraint_acknowledgment**: Acknowledging constraints ("Sure, I'll X while Y")
|
||||
|
||||
## Example Generated Data
|
||||
|
||||
```json
|
||||
{
|
||||
"episode_id": 0,
|
||||
"frame_index": 45,
|
||||
"timestamp": 2.5,
|
||||
"skill_current": "robot arm picks up pink lego brick",
|
||||
"skill_history": ["robot arm moves towards pink lego brick"],
|
||||
"task_description": "pink lego brick into the transparent box",
|
||||
"scenario_type": "specific_object",
|
||||
"response_type": "confirmation",
|
||||
"user_prompt": "Can you grab the pink brick?",
|
||||
"robot_utterance": "Sure, I'll pick up the pink lego brick."
|
||||
}
|
||||
```
|
||||
|
||||
## Accessing the Data
|
||||
|
||||
After running the script, access the synthetic data in your code:
|
||||
|
||||
```python
|
||||
from lerobot.datasets.lerobot_dataset import LeRobotDataset
|
||||
import pandas as pd
|
||||
|
||||
# Load modified dataset
|
||||
dataset = LeRobotDataset(repo_id="lerobot/svla_so101_pickplace_with_high_level_tasks")
|
||||
|
||||
# Access frame with high-level task
|
||||
frame = dataset[100]
|
||||
high_level_task_idx = frame["task_index_high_level"].item()
|
||||
|
||||
# Load high-level tasks
|
||||
tasks_df = pd.read_parquet(dataset.root / "meta" / "tasks_high_level.parquet")
|
||||
task_info = tasks_df.iloc[high_level_task_idx]
|
||||
|
||||
print(f"User prompt: {task_info['user_prompt']}")
|
||||
print(f"Robot utterance: {task_info['robot_utterance']}")
|
||||
print(f"Skill: {task_info['skill']}")
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
The script is modular and extensible:
|
||||
|
||||
```python
|
||||
# Core components
|
||||
class QwenPgen:
|
||||
"""VLM wrapper for generation"""
|
||||
def call_qwen(images, prompt) -> dict
|
||||
|
||||
def construct_prompt(task, history, skill) -> str
|
||||
"""Build prompt for VLM"""
|
||||
|
||||
def annotate_sample(pgen, images, ...) -> dict
|
||||
"""Generate dialogue for one sample"""
|
||||
|
||||
def generate_synthetic_data(dataset, pgen, ...) -> tuple
|
||||
"""Process entire dataset"""
|
||||
```
|
||||
|
||||
## Parameters
|
||||
|
||||
| Parameter | Default | Description |
|
||||
|-----------|---------|-------------|
|
||||
| `--repo-id` | - | HuggingFace dataset ID |
|
||||
| `--data-dir` | - | Local dataset path |
|
||||
| `--model` | Qwen/Qwen2-VL-7B-Instruct | VLM model name |
|
||||
| `--device` | cuda | Device (cuda/cpu) |
|
||||
| `--dtype` | bfloat16 | Model precision |
|
||||
| `--temperature` | 0.7 | Sampling temperature |
|
||||
| `--sample-interval` | 1.0 | Generate dialogue every N seconds (all episodes processed) |
|
||||
| `--num-image-views-per-sample` | 1 | Number of cameras |
|
||||
| `--output-dir` | None | Output directory |
|
||||
| `--push-to-hub` | False | Push to HuggingFace Hub |
|
||||
|
||||
## Sampling Strategy
|
||||
|
||||
The script uses **temporal sampling** to efficiently generate dialogue:
|
||||
|
||||
- **Default**: Generate dialogue every 1 second (`--sample-interval 1.0`)
|
||||
- **Efficiency**: If a dataset runs at 30fps, this samples ~3% of frames
|
||||
- **Propagation**: Frames between samples reuse the last generated task_index
|
||||
- **Episode-aware**: Always samples the first frame of each episode
|
||||
|
||||
### Example with 30 fps dataset:
|
||||
```bash
|
||||
# Sample every 1 second (every 30 frames)
|
||||
--sample-interval 1.0 # ~3,000 generations for a 100 episode dataset (3 sec/episode)
|
||||
|
||||
# Sample every 0.5 seconds (every 15 frames)
|
||||
--sample-interval 0.5 # ~6,000 generations (more granular)
|
||||
|
||||
# Sample every 2 seconds (every 60 frames)
|
||||
--sample-interval 2.0 # ~1,500 generations (more efficient)
|
||||
```
|
||||
|
||||
### Why sampling works:
|
||||
- Skills typically last 1-3 seconds
|
||||
- Dialogue doesn't need to change every frame
|
||||
- Reduces computational cost by 30-100x
|
||||
- Still provides good coverage for training
|
||||
|
||||
## Tips
|
||||
|
||||
1. **Quick testing**: Use larger `--sample-interval` (e.g., 5.0 or 10.0) for rapid iteration
|
||||
2. **Monitor GPU**: VLM inference is memory-intensive
|
||||
3. **Check outputs**: Review `syn_annotations.jsonl` for quality
|
||||
4. **Adjust temperature**: Higher = more diverse, lower = more consistent
|
||||
5. **Multiple views**: Use `--num-image-views-per-sample 2+` for better context
|
||||
6. **Tune sampling**: Start with 1.0s, increase for speed (testing), decrease for granularity (production)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### No skills.json found
|
||||
Run `annotate.py` first to generate skill annotations.
|
||||
|
||||
### Out of memory
|
||||
- Reduce batch size to 1
|
||||
- Use smaller model (Qwen2-VL-7B instead of Qwen3-VL-30B)
|
||||
- Process fewer samples at a time
|
||||
|
||||
### Poor quality generations
|
||||
- Adjust temperature (try 0.6-0.9)
|
||||
- Check that skills.json has good annotations
|
||||
- Ensure images are loading correctly
|
||||
|
||||
## Citation
|
||||
|
||||
Based on the Hi-Robot paper's synthetic data generation approach:
|
||||
```
|
||||
@article{hirobot2024,
|
||||
title={Hi-Robot: Hierarchical Robot Learning with Vision-Language Models},
|
||||
year={2024}
|
||||
}
|
||||
```
|
||||
|
||||
@@ -0,0 +1,141 @@
|
||||
# Temporal Sampling Strategy Visualization
|
||||
|
||||
## How `--sample-interval` Works
|
||||
|
||||
### Example: 30 fps dataset, `--sample-interval 1.0` (1 second)
|
||||
|
||||
```
|
||||
Timeline (seconds): 0.0 0.5 1.0 1.5 2.0 2.5 3.0
|
||||
│ │ │ │ │ │ │
|
||||
Frames: 0───15───30───45───60───75───90───105──120──135──150
|
||||
│ │ │ │ │ │ │
|
||||
▼ ▼ ▼ ▼
|
||||
Sampled: YES NO YES NO YES NO YES
|
||||
│ │ │ │
|
||||
Task Index: [0]──────────────>[1]──────────────>[2]──────────────>[3]
|
||||
│ │ │ │
|
||||
VLM Called: ✓ Gen ✓ Gen ✓ Gen ✓ Gen
|
||||
dialogue dialogue dialogue dialogue
|
||||
│ │ │ │
|
||||
Frames 0-29 ─────┘ │ │ │
|
||||
get task 0 │ │ │
|
||||
│ │ │
|
||||
Frames 30-59 ────────────────────────┘ │ │
|
||||
get task 1 │ │
|
||||
│ │
|
||||
Frames 60-89 ──────────────────────────────────────────┘ │
|
||||
get task 2 │
|
||||
│
|
||||
Frames 90-119 ────────────────────────────────────────────────────────────┘
|
||||
get task 3
|
||||
```
|
||||
|
||||
## Comparison: Different Sampling Intervals
|
||||
|
||||
### `--sample-interval 2.0` (every 2 seconds)
|
||||
```
|
||||
Timeline: 0.0 1.0 2.0 3.0 4.0 5.0 6.0
|
||||
│ │ │ │ │ │ │
|
||||
Sampled: YES NO YES NO YES NO YES
|
||||
│ │ │ │
|
||||
Tasks: [0]───────────────>[1]───────────────>[2]───────────────>[3]
|
||||
|
||||
VLM Calls: 4 (fewer calls, faster but less granular)
|
||||
```
|
||||
|
||||
### `--sample-interval 1.0` (every 1 second) - **DEFAULT**
|
||||
```
|
||||
Timeline: 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
|
||||
│ │ │ │ │ │ │ │ │ │ │ │ │
|
||||
Sampled: YES NO YES NO YES NO YES NO YES NO YES NO YES
|
||||
│ │ │ │ │ │ │
|
||||
Tasks: [0]─────────>[1]─────────>[2]─────────>[3]─────────>[4]─────────>[5]─────>[6]
|
||||
|
||||
VLM Calls: 7 (balanced coverage and speed)
|
||||
```
|
||||
|
||||
### `--sample-interval 0.5` (every 0.5 seconds)
|
||||
```
|
||||
Timeline: 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
|
||||
│ │ │ │ │ │ │ │ │ │ │ │ │
|
||||
Sampled: YES YES YES YES YES YES YES YES YES YES YES YES YES
|
||||
│ │ │ │ │ │ │ │ │ │ │ │ │
|
||||
Tasks: [0]─>[1]─>[2]─>[3]─>[4]─>[5]─>[6]─>[7]─>[8]─>[9]─>[10]>[11]>[12]
|
||||
|
||||
VLM Calls: 13 (high granularity, slower but more detailed)
|
||||
```
|
||||
|
||||
## Episode Boundaries
|
||||
|
||||
The script always samples the **first frame** of each episode:
|
||||
|
||||
```
|
||||
Episode 0 Episode 1 Episode 2
|
||||
├─────────────────────────────────┤├─────────────────────────────────┤├──────...
|
||||
│ ││ ││
|
||||
Frame: 0 30 60 90 120 130 160 190 220 250 260 290 320
|
||||
Time: 0.0 1.0 2.0 3.0 4.0 0.0 1.0 2.0 3.0 4.0 0.0 1.0 2.0
|
||||
│ │ │ │ │ │ │ │ │ │ │ │ │
|
||||
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
|
||||
Sample:YES YES YES YES YES YES YES YES YES YES YES YES YES
|
||||
│ │ │ │ │ │ │ │ │ │ │ │ │
|
||||
Task: 0────1─────2─────3────4 5─────6─────7─────8────9 10────11───12
|
||||
|
||||
Note: Frames 0, 130, 260 are ALWAYS sampled (episode starts)
|
||||
Even if they're within the sample-interval window
|
||||
```
|
||||
|
||||
## Real-World Example: svla_so101_pickplace Dataset
|
||||
|
||||
Typical stats:
|
||||
- **Total episodes**: 50
|
||||
- **Avg episode length**: 300 frames (10 seconds at 30 fps)
|
||||
- **Total frames**: 15,000
|
||||
|
||||
### Without Sampling (every frame)
|
||||
```
|
||||
Frames processed: 15,000
|
||||
VLM calls: 15,000
|
||||
Time estimate: ~5 hours
|
||||
Unique tasks: ~12,000 (lots of duplicates)
|
||||
```
|
||||
|
||||
### With `--sample-interval 1.0` (every 1 second)
|
||||
```
|
||||
Frames processed: 15,000 ✓
|
||||
VLM calls: 500
|
||||
Time estimate: ~10 minutes
|
||||
Unique tasks: ~450 (meaningful variety)
|
||||
Efficiency gain: 30x faster
|
||||
```
|
||||
|
||||
### With `--sample-interval 2.0` (every 2 seconds)
|
||||
```
|
||||
Frames processed: 15,000 ✓
|
||||
VLM calls: 250
|
||||
Time estimate: ~5 minutes
|
||||
Unique tasks: ~220
|
||||
Efficiency gain: 60x faster
|
||||
```
|
||||
|
||||
## Key Points
|
||||
|
||||
1. **All frames get labeled**: Every frame gets a `task_index_high_level`
|
||||
2. **Only sampled frames call VLM**: Huge efficiency gain
|
||||
3. **Temporal coherence**: Nearby frames share the same task
|
||||
4. **Episode-aware**: Always samples episode starts
|
||||
5. **Configurable**: Adjust `--sample-interval` based on your needs
|
||||
|
||||
## Choosing Your Sampling Interval
|
||||
|
||||
| Use Case | Recommended Interval | Why |
|
||||
|----------|---------------------|-----|
|
||||
| Quick testing | 2.0s | Fastest iteration |
|
||||
| Standard training | 1.0s | Good balance |
|
||||
| High-quality dataset | 0.5s | Better coverage |
|
||||
| Fine-grained control | 0.33s | Very detailed |
|
||||
| Dense annotations | 0.1s | Nearly every frame |
|
||||
|
||||
**Rule of thumb**: Match your sampling interval to your typical skill duration.
|
||||
If skills last 1-3 seconds, sampling every 1 second captures each skill multiple times.
|
||||
|
||||
@@ -0,0 +1,138 @@
|
||||
#!/usr/bin/env python
|
||||
|
||||
"""
|
||||
Example demonstrating how to use the ActionTokenizerProcessorStep to tokenize actions.
|
||||
|
||||
This example shows how to:
|
||||
1. Load a dataset with action data
|
||||
2. Apply the action tokenizer processor to tokenize actions with proper padding/truncation
|
||||
3. Access both the tokenized actions and the attention mask
|
||||
4. Decode tokenized actions back to their original form
|
||||
"""
|
||||
|
||||
import torch
|
||||
from transformers import AutoProcessor
|
||||
|
||||
from lerobot.datasets.lerobot_dataset import LeRobotDataset
|
||||
from lerobot.processor.core import EnvTransition, TransitionKey
|
||||
from lerobot.processor.tokenizer_processor import ActionTokenizerProcessorStep
|
||||
from lerobot.utils.constants import ACTION_TOKEN_MASK
|
||||
|
||||
# Define delta timestamps for the dataset
|
||||
delta_timestamps = {
|
||||
'action': [
|
||||
0.0, 0.03333333333333333, 0.06666666666666667, 0.1, 0.13333333333333333,
|
||||
0.16666666666666666, 0.2, 0.23333333333333334, 0.26666666666666666, 0.3,
|
||||
0.3333333333333333, 0.36666666666666664, 0.4, 0.43333333333333335,
|
||||
0.4666666666666667, 0.5, 0.5333333333333333, 0.5666666666666667, 0.6,
|
||||
0.6333333333333333, 0.6666666666666666, 0.7, 0.7333333333333333,
|
||||
0.7666666666666667, 0.8, 0.8333333333333334, 0.8666666666666667, 0.9,
|
||||
0.9333333333333333, 0.9666666666666667, 1.0, 1.0333333333333334,
|
||||
1.0666666666666667, 1.1, 1.1333333333333333, 1.1666666666666667, 1.2,
|
||||
1.2333333333333334, 1.2666666666666666, 1.3, 1.3333333333333333,
|
||||
1.3666666666666667, 1.4, 1.4333333333333333, 1.4666666666666666, 1.5,
|
||||
1.5333333333333334, 1.5666666666666667, 1.6, 1.6333333333333333
|
||||
]
|
||||
}
|
||||
|
||||
# Load the dataset
|
||||
print("Loading dataset...")
|
||||
dataset = LeRobotDataset(
|
||||
repo_id="local",
|
||||
root="/fsx/jade_choghari/outputs/pgen_annotations1",
|
||||
delta_timestamps=delta_timestamps
|
||||
)
|
||||
|
||||
# Create a dataloader
|
||||
dataloader = torch.utils.data.DataLoader(
|
||||
dataset,
|
||||
num_workers=0,
|
||||
batch_size=4,
|
||||
shuffle=True,
|
||||
)
|
||||
|
||||
# Get a batch of data
|
||||
batch = next(iter(dataloader))
|
||||
action_data = batch["action"] # Shape: (batch_size, action_horizon, action_dim)
|
||||
|
||||
print(f"\nOriginal action shape: {action_data.shape}")
|
||||
print(f"Original action data (first sample, first timestep):\n{action_data[0, 0]}")
|
||||
|
||||
# Method 1: Using the tokenizer directly (as in fast_tokenize.py)
|
||||
print("\n" + "="*80)
|
||||
print("Method 1: Direct tokenizer usage")
|
||||
print("="*80)
|
||||
|
||||
tokenizer = AutoProcessor.from_pretrained("physical-intelligence/fast", trust_remote_code=True)
|
||||
|
||||
# Tokenize directly
|
||||
tokens = tokenizer(action_data)
|
||||
print(f"\nDirect tokenization result type: {type(tokens)}")
|
||||
print(f"Tokens shape/length: {tokens.shape if isinstance(tokens, torch.Tensor) else len(tokens)}")
|
||||
|
||||
# Decode
|
||||
decoded_actions = tokenizer.decode(tokens)
|
||||
print(f"Decoded actions shape: {decoded_actions.shape}")
|
||||
reconstruction_error = torch.abs(action_data - decoded_actions).mean()
|
||||
print(f"Mean absolute reconstruction error: {reconstruction_error.item():.6f}")
|
||||
|
||||
# Method 2: Using the ActionTokenizerProcessorStep with proper padding/truncation
|
||||
print("\n" + "="*80)
|
||||
print("Method 2: Using ActionTokenizerProcessorStep (with padding & mask)")
|
||||
print("="*80)
|
||||
|
||||
# Create the action tokenizer processor step
|
||||
action_tokenizer_processor = ActionTokenizerProcessorStep(
|
||||
tokenizer_name="physical-intelligence/fast",
|
||||
trust_remote_code=True,
|
||||
max_action_tokens=32, # Maximum number of tokens per action
|
||||
)
|
||||
|
||||
# Create a transition with the action data
|
||||
transition = {
|
||||
TransitionKey.ACTION: action_data,
|
||||
TransitionKey.OBSERVATION: {}, # Empty for this example
|
||||
}
|
||||
|
||||
# Apply the processor
|
||||
processed_transition = action_tokenizer_processor(transition)
|
||||
|
||||
# Extract tokenized actions and mask
|
||||
tokenized_actions = processed_transition[TransitionKey.ACTION]
|
||||
complementary_data = processed_transition[TransitionKey.COMPLEMENTARY_DATA]
|
||||
action_mask = complementary_data[ACTION_TOKEN_MASK]
|
||||
|
||||
print(f"\nTokenized actions shape: {tokenized_actions.shape}") # (batch_size, max_action_tokens)
|
||||
print(f"Action mask shape: {action_mask.shape}") # (batch_size, max_action_tokens)
|
||||
print(f"Tokenized actions dtype: {tokenized_actions.dtype}")
|
||||
print(f"Action mask dtype: {action_mask.dtype}")
|
||||
|
||||
# Show token statistics
|
||||
print(f"\nFirst sample tokens: {tokenized_actions[0]}")
|
||||
print(f"First sample mask: {action_mask[0]}")
|
||||
num_real_tokens = action_mask[0].sum().item()
|
||||
print(f"Number of real tokens (non-padding): {num_real_tokens}")
|
||||
print(f"Number of padding tokens: {action_mask.shape[1] - num_real_tokens}")
|
||||
|
||||
# Decode using the mask
|
||||
print("\nDecoding tokenized actions...")
|
||||
decoded_with_processor = tokenizer.decode(tokenized_actions)
|
||||
print(f"Decoded actions shape: {decoded_with_processor.shape}")
|
||||
|
||||
# Calculate reconstruction error
|
||||
reconstruction_error_processor = torch.abs(action_data - decoded_with_processor).mean()
|
||||
print(f"Mean absolute reconstruction error: {reconstruction_error_processor.item():.6f}")
|
||||
|
||||
# Show that masking works correctly
|
||||
print("\n" + "="*80)
|
||||
print("Mask demonstration")
|
||||
print("="*80)
|
||||
for i in range(min(4, tokenized_actions.shape[0])):
|
||||
mask_i = action_mask[i]
|
||||
num_real = mask_i.sum().item()
|
||||
print(f"Sample {i}: {num_real} real tokens, {len(mask_i) - num_real} padding tokens")
|
||||
|
||||
print("\n" + "="*80)
|
||||
print("Action tokenization example completed successfully!")
|
||||
print("="*80)
|
||||
|
||||
@@ -0,0 +1,143 @@
|
||||
# Example: Synthetic Data Generation with Sampling
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Test with 100 frames and 1 second sampling
|
||||
```bash
|
||||
python examples/dataset/annotate_pgen.py \
|
||||
--data-dir /fsx/jade_choghari/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace \
|
||||
--model Qwen/Qwen2-VL-7B-Instruct \
|
||||
--num-samples 100 \
|
||||
--sample-interval 1.0 \
|
||||
--output-dir ./outputs/test_pgen
|
||||
```
|
||||
|
||||
**Expected behavior** (assuming 30 fps):
|
||||
- Total frames: 100
|
||||
- Frames sampled: ~4 (every 30 frames = 1 second)
|
||||
- Efficiency: 96% fewer VLM calls
|
||||
- Output: All 100 frames get `task_index_high_level`, but only 4 unique dialogues generated
|
||||
|
||||
### 2. Process full dataset with different sampling rates
|
||||
|
||||
#### Conservative (every 2 seconds)
|
||||
```bash
|
||||
python examples/dataset/annotate_pgen.py \
|
||||
--data-dir /fsx/jade_choghari/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace \
|
||||
--model Qwen/Qwen2-VL-7B-Instruct \
|
||||
--sample-interval 2.0 \
|
||||
--output-dir ./outputs/pgen_2s
|
||||
```
|
||||
|
||||
#### Standard (every 1 second) - **RECOMMENDED**
|
||||
```bash
|
||||
python examples/dataset/annotate_pgen.py \
|
||||
--data-dir /fsx/jade_choghari/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace \
|
||||
--model Qwen/Qwen2-VL-7B-Instruct \
|
||||
--sample-interval 1.0 \
|
||||
--output-dir ./outputs/pgen_1s
|
||||
```
|
||||
|
||||
#### Fine-grained (every 0.5 seconds)
|
||||
```bash
|
||||
python examples/dataset/annotate_pgen.py \
|
||||
--data-dir /fsx/jade_choghari/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace \
|
||||
--model Qwen/Qwen2-VL-7B-Instruct \
|
||||
--sample-interval 0.5 \
|
||||
--output-dir ./outputs/pgen_0.5s
|
||||
```
|
||||
|
||||
## Performance Estimates
|
||||
|
||||
For a dataset with:
|
||||
- 100 episodes
|
||||
- 10 seconds per episode (average)
|
||||
- 30 fps
|
||||
- Total frames: 30,000
|
||||
|
||||
| Sampling Interval | Frames Sampled | % Sampled | Speedup | Time Estimate |
|
||||
|-------------------|----------------|-----------|---------|---------------|
|
||||
| Every frame (0.033s) | 30,000 | 100% | 1x | ~10 hours |
|
||||
| 0.5 seconds | 2,000 | 6.7% | 15x | ~40 min |
|
||||
| **1.0 seconds** | **1,000** | **3.3%** | **30x** | **~20 min** |
|
||||
| 2.0 seconds | 500 | 1.7% | 60x | ~10 min |
|
||||
|
||||
*Note: Times are approximate and depend on GPU, model size, and generation speed*
|
||||
|
||||
## Understanding the Output
|
||||
|
||||
### Console Output Example
|
||||
```
|
||||
[cyan]Generating synthetic data for 30000 frames...[/cyan]
|
||||
[cyan]Sampling interval: 1.0s (fps: 30)[/cyan]
|
||||
Generating synthetic dialogue: 100%|████████| 30000/30000 [20:15<00:00, 24.68it/s]
|
||||
[green]✓ Sampled 1000 frames out of 30000 (3.3%)[/green]
|
||||
[green]✓ Generated 450 unique high-level tasks[/green]
|
||||
```
|
||||
|
||||
### What happens:
|
||||
1. **Frame 0 (t=0.0s)**: Generate dialogue → Task index 0
|
||||
2. **Frames 1-29 (t=0.033s-0.967s)**: Reuse task index 0
|
||||
3. **Frame 30 (t=1.0s)**: Generate new dialogue → Task index 1
|
||||
4. **Frames 31-59 (t=1.033s-1.967s)**: Reuse task index 1
|
||||
5. And so on...
|
||||
|
||||
### Result:
|
||||
- Every frame has a `task_index_high_level`
|
||||
- Only sampled frames have unique dialogues generated
|
||||
- Intermediate frames inherit from the most recent sample
|
||||
- Maintains temporal coherence within episodes
|
||||
|
||||
## Checking Your Results
|
||||
|
||||
After running, verify the output:
|
||||
|
||||
```bash
|
||||
# Check the generated tasks
|
||||
python -c "
|
||||
import pandas as pd
|
||||
from pathlib import Path
|
||||
|
||||
tasks = pd.read_parquet('outputs/test_pgen/meta/tasks_high_level.parquet')
|
||||
print(f'Total unique tasks: {len(tasks)}')
|
||||
print(f'Sample tasks:')
|
||||
print(tasks[['user_prompt', 'robot_utterance', 'skill']].head())
|
||||
"
|
||||
|
||||
# Check debug output
|
||||
head outputs/test_pgen/meta/syn_annotations.jsonl
|
||||
|
||||
# Load and verify dataset
|
||||
python -c "
|
||||
from lerobot.datasets.lerobot_dataset import LeRobotDataset
|
||||
|
||||
ds = LeRobotDataset(repo_id='local_with_high_level_tasks',
|
||||
root='outputs/test_pgen')
|
||||
print(f'Dataset has {len(ds)} frames')
|
||||
print(f'Features: {list(ds.features.keys())}')
|
||||
assert 'task_index_high_level' in ds.features
|
||||
print('✓ task_index_high_level feature added successfully!')
|
||||
"
|
||||
```
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### Development/Testing
|
||||
```bash
|
||||
--sample-interval 2.0 # Fast iteration
|
||||
--num-samples 500 # Small subset
|
||||
```
|
||||
|
||||
### Production Training
|
||||
```bash
|
||||
--sample-interval 1.0 # Good coverage
|
||||
# Process all samples (no --num-samples)
|
||||
```
|
||||
|
||||
### High-Quality Dataset
|
||||
```bash
|
||||
--sample-interval 0.5 # Fine-grained
|
||||
--temperature 0.6 # More consistent
|
||||
--model Qwen/Qwen3-VL-30B-A3B-Instruct # Larger model
|
||||
```
|
||||
|
||||
@@ -0,0 +1,25 @@
|
||||
import numpy as np
|
||||
from transformers import AutoProcessor
|
||||
import torch
|
||||
from lerobot.datasets.lerobot_dataset import LeRobotDataset, LeRobotDatasetMetadata
|
||||
|
||||
delta_timestamps = {'action': [0.0, 0.03333333333333333, 0.06666666666666667, 0.1, 0.13333333333333333, 0.16666666666666666, 0.2, 0.23333333333333334, 0.26666666666666666, 0.3, 0.3333333333333333, 0.36666666666666664, 0.4, 0.43333333333333335, 0.4666666666666667, 0.5, 0.5333333333333333, 0.5666666666666667, 0.6, 0.6333333333333333, 0.6666666666666666, 0.7, 0.7333333333333333, 0.7666666666666667, 0.8, 0.8333333333333334, 0.8666666666666667, 0.9, 0.9333333333333333, 0.9666666666666667, 1.0, 1.0333333333333334, 1.0666666666666667, 1.1, 1.1333333333333333, 1.1666666666666667, 1.2, 1.2333333333333334, 1.2666666666666666, 1.3, 1.3333333333333333, 1.3666666666666667, 1.4, 1.4333333333333333, 1.4666666666666666, 1.5, 1.5333333333333334, 1.5666666666666667, 1.6, 1.6333333333333333]}
|
||||
dataset = LeRobotDataset(repo_id="local", root="/fsx/jade_choghari/outputs/pgen_annotations1", delta_timestamps=delta_timestamps)
|
||||
|
||||
dataloader = torch.utils.data.DataLoader(
|
||||
dataset,
|
||||
num_workers=0,
|
||||
batch_size=4,
|
||||
shuffle=True,
|
||||
)
|
||||
|
||||
batch = next(iter(dataloader))
|
||||
|
||||
# Load the tokenizer from the Hugging Face hub
|
||||
tokenizer = AutoProcessor.from_pretrained("physical-intelligence/fast", trust_remote_code=True)
|
||||
|
||||
# Tokenize & decode action chunks (we use dummy data here)
|
||||
action_data = batch["action"] # one batch of action chunks
|
||||
tokens = tokenizer(action_data) # tokens = list[int]
|
||||
decoded_actions = tokenizer.decode(tokens)
|
||||
print("tokenized actions: ", tokens)
|
||||
@@ -0,0 +1,17 @@
|
||||
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
|
||||
|
||||
model_id = "google/paligemma-3b-pt-224"
|
||||
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id)
|
||||
processor = AutoProcessor.from_pretrained(model_id)
|
||||
|
||||
breakpoint()
|
||||
prefix_output = model.language_model.forward(
|
||||
inputs_embeds=inputs_embeds[0],
|
||||
attention_mask=attention_mask,
|
||||
position_ids=position_ids,
|
||||
adarms_cond=adarms_cond[0] if adarms_cond is not None else None,
|
||||
)
|
||||
prefix_past_key_values = prefix_output.past_key_values
|
||||
# prefix_output to be used for the language head
|
||||
# shape: [batch_size, seq_len, hidden_size] with hidden_size = 2048
|
||||
prefix_output = prefix_output.last_hidden_state
|
||||
@@ -0,0 +1,91 @@
|
||||
import torch
|
||||
from huggingface_hub import HfApi
|
||||
|
||||
import lerobot
|
||||
from lerobot.datasets.lerobot_dataset import LeRobotDataset, LeRobotDatasetMetadata
|
||||
# import make_pre_post_processors
|
||||
from lerobot.policies.factory import make_pre_post_processors
|
||||
from lerobot.policies.pi05.configuration_pi05 import PI05Config
|
||||
from lerobot.policies.factory import make_policy, make_policy_config
|
||||
from lerobot.configs.policies import PreTrainedConfig
|
||||
|
||||
cfg = PreTrainedConfig.from_pretrained(
|
||||
pretrained_name_or_path="/fsx/jade_choghari/outputs/pi0_training/checkpoints/last/pretrained_model",
|
||||
)
|
||||
cfg.dtype = "bfloat16"
|
||||
|
||||
pre_processor, post_processor = make_pre_post_processors(
|
||||
policy_cfg=cfg,
|
||||
pretrained_path="/fsx/jade_choghari/outputs/pi0_training/checkpoints/last/pretrained_model",
|
||||
)
|
||||
|
||||
delta_timestamps = {'action': [0.0, 0.03333333333333333, 0.06666666666666667, 0.1, 0.13333333333333333, 0.16666666666666666, 0.2, 0.23333333333333334, 0.26666666666666666, 0.3, 0.3333333333333333, 0.36666666666666664, 0.4, 0.43333333333333335, 0.4666666666666667, 0.5, 0.5333333333333333, 0.5666666666666667, 0.6, 0.6333333333333333, 0.6666666666666666, 0.7, 0.7333333333333333, 0.7666666666666667, 0.8, 0.8333333333333334, 0.8666666666666667, 0.9, 0.9333333333333333, 0.9666666666666667, 1.0, 1.0333333333333334, 1.0666666666666667, 1.1, 1.1333333333333333, 1.1666666666666667, 1.2, 1.2333333333333334, 1.2666666666666666, 1.3, 1.3333333333333333, 1.3666666666666667, 1.4, 1.4333333333333333, 1.4666666666666666, 1.5, 1.5333333333333334, 1.5666666666666667, 1.6, 1.6333333333333333]}
|
||||
|
||||
dataset = LeRobotDataset(repo_id="local", root="/fsx/jade_choghari/outputs/pgen_annotations1", delta_timestamps=delta_timestamps)
|
||||
|
||||
# rename map --rename_map='{
|
||||
# "observation.images.side": "observation.images.base_0_rgb",
|
||||
# "observation.images.up": "observation.images.left_wrist_0_rgb"
|
||||
# }'
|
||||
rename_map = {
|
||||
"observation.images.side": "observation.images.base_0_rgb",
|
||||
"observation.images.up": "observation.images.left_wrist_0_rgb"
|
||||
}
|
||||
policy = make_policy(
|
||||
cfg=cfg,
|
||||
ds_meta=dataset.meta,
|
||||
rename_map=rename_map,
|
||||
)
|
||||
|
||||
dataloader = torch.utils.data.DataLoader(
|
||||
dataset,
|
||||
num_workers=0,
|
||||
batch_size=4,
|
||||
shuffle=True,
|
||||
)
|
||||
|
||||
batch = next(iter(dataloader))
|
||||
batch = pre_processor(batch)
|
||||
policy.train()
|
||||
# run inference
|
||||
# action = policy.select_action(batch)
|
||||
loss, loss_dict = policy.forward(batch)
|
||||
breakpoint()
|
||||
# import requests
|
||||
# from PIL import Image
|
||||
# from transformers import AutoProcessor
|
||||
# model = policy.model.paligemma_with_expert.paligemma
|
||||
# model = model.to(device="cuda", dtype=torch.bfloat16)
|
||||
# model.eval()
|
||||
# prompt = "Describe this image."
|
||||
# url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
|
||||
# image = Image.open(requests.get(url, stream=True).raw)
|
||||
# processor = AutoProcessor.from_pretrained(
|
||||
# "google/paligemma-3b-pt-224",
|
||||
# )
|
||||
# inputs = processor(image, prompt, return_tensors="pt").to(model.device)
|
||||
# print("generating...")
|
||||
# output = model.generate(
|
||||
# **inputs,
|
||||
# max_new_tokens=50,
|
||||
# use_cache=True, # default dynamic cache
|
||||
# )
|
||||
# print(processor.decode(output[0], skip_special_tokens=True))
|
||||
|
||||
|
||||
# # other model
|
||||
# from transformers import PaliGemmaForConditionalGeneration
|
||||
# model = PaliGemmaForConditionalGeneration.from_pretrained(
|
||||
# "google/paligemma2-3b-pt-224",
|
||||
# torch_dtype=torch.bfloat16,
|
||||
# device_map="auto",
|
||||
# )
|
||||
# model.eval()
|
||||
# print("generating...")
|
||||
# output = model.generate(
|
||||
# **inputs,
|
||||
# max_new_tokens=100,
|
||||
# use_cache=True, # default dynamic cache
|
||||
# )
|
||||
# print("Model 2 output:")
|
||||
# print(processor.decode(output[0], skip_special_tokens=True))
|
||||
@@ -0,0 +1,23 @@
|
||||
import torch
|
||||
from huggingface_hub import HfApi
|
||||
|
||||
import lerobot
|
||||
from lerobot.datasets.lerobot_dataset import LeRobotDataset, LeRobotDatasetMetadata
|
||||
|
||||
dataset = LeRobotDataset(repo_id="local", root="/fsx/jade_choghari/outputs/pgen_annotations1")
|
||||
|
||||
dataloader = torch.utils.data.DataLoader(
|
||||
dataset,
|
||||
num_workers=0,
|
||||
batch_size=32,
|
||||
shuffle=True,
|
||||
)
|
||||
|
||||
batch = next(iter(dataloader))
|
||||
print(batch.keys())
|
||||
print(batch['task_index_high_level'].shape)
|
||||
print(batch['task_index_high_level'])
|
||||
print(batch['user_prompt'][0])
|
||||
print(batch['robot_utterance'][0])
|
||||
print(batch['task'][0])
|
||||
breakpoint()
|
||||
@@ -0,0 +1,18 @@
|
||||
import torch
|
||||
from huggingface_hub import HfApi
|
||||
|
||||
import lerobot
|
||||
from lerobot.datasets.lerobot_dataset import LeRobotDataset, LeRobotDatasetMetadata
|
||||
|
||||
dataset = LeRobotDataset(repo_id="lerobot/libero")
|
||||
|
||||
dataloader = torch.utils.data.DataLoader(
|
||||
dataset,
|
||||
num_workers=0,
|
||||
batch_size=4,
|
||||
shuffle=True,
|
||||
)
|
||||
batch = next(iter(dataloader))
|
||||
print(batch.keys())
|
||||
|
||||
breakpoint()
|
||||
@@ -0,0 +1,159 @@
|
||||
## One-sentence answer
|
||||
|
||||
> `make_att_2d_masks(prefix_pad_masks, prefix_att_masks)` builds the **actual 2D attention mask** `[B, L, L]` that tells the transformer **which token positions may attend to which others**, combining **padding** and **causality**.
|
||||
|
||||
Everything else you’ve seen so far was just metadata.
|
||||
|
||||
---
|
||||
|
||||
## What goes in
|
||||
|
||||
### Inputs
|
||||
|
||||
```python
|
||||
prefix_pad_masks # shape [B, L]
|
||||
prefix_att_masks # shape [B, L]
|
||||
```
|
||||
|
||||
Where:
|
||||
|
||||
* `prefix_pad_masks[b, i] = True`
|
||||
→ token `i` exists (not padding)
|
||||
|
||||
* `prefix_att_masks[b, i] = False`
|
||||
→ token `i` is **bidirectional**
|
||||
|
||||
* `prefix_att_masks[b, i] = True`
|
||||
→ token `i` is **causal (autoregressive)**
|
||||
|
||||
---
|
||||
|
||||
## What comes out
|
||||
|
||||
```python
|
||||
att_2d_prefix # shape [B, L, L]
|
||||
```
|
||||
|
||||
Each entry:
|
||||
|
||||
```text
|
||||
att_2d_prefix[b, i, j] = True
|
||||
```
|
||||
|
||||
means:
|
||||
|
||||
> “In batch `b`, **token i (query)** is allowed to attend to **token j (key)**.”
|
||||
|
||||
---
|
||||
|
||||
## How it is constructed (conceptually)
|
||||
|
||||
For **each batch b**, **each query position i**, **each key position j**:
|
||||
|
||||
```python
|
||||
if not prefix_pad_masks[b, j]:
|
||||
att[b, i, j] = False # cannot attend to padding
|
||||
else if not prefix_att_masks[b, i]:
|
||||
att[b, i, j] = True # bidirectional token → can see all real tokens
|
||||
else:
|
||||
att[b, i, j] = (j <= i) # causal token → can see only past + itself
|
||||
```
|
||||
|
||||
That’s it.
|
||||
|
||||
---
|
||||
|
||||
## Tiny concrete example (exactly matching your code)
|
||||
|
||||
Suppose:
|
||||
|
||||
```python
|
||||
prefix_pad_masks[0] = [T, T, T, T, T, F]
|
||||
prefix_att_masks[0] = [F, F, F, T, T, T]
|
||||
```
|
||||
|
||||
Tokens:
|
||||
|
||||
```
|
||||
0: IMG
|
||||
1: IMG
|
||||
2: LANG
|
||||
3: SUB0
|
||||
4: SUB1
|
||||
5: PAD
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Resulting `att_2d_prefix[0]`
|
||||
|
||||
`✓ = True, ✗ = False`
|
||||
|
||||
| Q \ K | 0 | 1 | 2 | 3 | 4 | 5 |
|
||||
| ---------- | - | - | - | - | - | - |
|
||||
| 0 (bi) | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ |
|
||||
| 1 (bi) | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ |
|
||||
| 2 (bi) | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ |
|
||||
| 3 (causal) | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ |
|
||||
| 4 (causal) | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ |
|
||||
| 5 (pad) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
|
||||
|
||||
---
|
||||
|
||||
## Why this matters for your training code
|
||||
|
||||
This line:
|
||||
|
||||
```python
|
||||
att_2d_prefix_4d = self._prepare_attention_masks_4d(att_2d_prefix)
|
||||
```
|
||||
|
||||
Converts `[B, L, L] → [B, 1, L, L]` and possibly flips True/False to `0/-inf`.
|
||||
|
||||
This is **exactly what Paligemma uses inside self-attention**.
|
||||
|
||||
---
|
||||
|
||||
## Key implications (VERY important)
|
||||
|
||||
### 1️⃣ This mask does **not isolate token groups**
|
||||
|
||||
* Bidirectional tokens can attend to **everything**
|
||||
* Causal tokens only restrict *their own row*
|
||||
|
||||
So **flow/action tokens must be blocked separately**.
|
||||
|
||||
---
|
||||
|
||||
### 2️⃣ This is why your AR subtask prediction works
|
||||
|
||||
* Subtask tokens are causal
|
||||
* Output at position `i` predicts token `i+1`
|
||||
* Padding is fully ignored
|
||||
|
||||
---
|
||||
|
||||
### 3️⃣ Inference behavior
|
||||
|
||||
When `subtask_tokens = None`:
|
||||
|
||||
* `prefix_att_masks` contains only `False`
|
||||
* `att_2d_prefix` becomes **fully bidirectional**
|
||||
* No AR behavior remains
|
||||
|
||||
Exactly what you want.
|
||||
|
||||
---
|
||||
|
||||
## One-sentence takeaway (commit this)
|
||||
|
||||
> `make_att_2d_masks` fuses **padding** and **causality** into a concrete `[B, L, L]` attention matrix that the transformer actually uses.
|
||||
|
||||
If you want next, I can:
|
||||
|
||||
* inspect `make_att_2d_masks()` source with you
|
||||
* show how to block **flow → subtask** attention
|
||||
* explain how this changes when suffix tokens are added
|
||||
* help you refactor this into a cleaner “grouped attention” API
|
||||
|
||||
You’re now at the point where the model’s behavior should feel *predictable*, not magical.
|
||||
@@ -0,0 +1,334 @@
|
||||
Generate annotate_pgen.py using Qwen for synthetic data generation
|
||||
|
||||
You are writing a Python script called annotate_pgen.py.
|
||||
This script generates synthetic user prompts (ℓ_t) and robot utterances (u_t) for Hi Robot–style hierarchical policy training, using Qwen 3vl as the generator model (pgen).
|
||||
|
||||
SCRIPT PURPOSE
|
||||
|
||||
The script must:
|
||||
|
||||
Load Dlabeled which is a LeRobot Dataset that has been annotate using the annotate.py script, which contains:
|
||||
|
||||
images: list of image paths at time t
|
||||
|
||||
skill_current: the annotated skill label (ℓ̂_t)
|
||||
|
||||
skill_history: list of previous skill labels (ℓ̂₀ … ℓ̂_{t−1}), those where annotated, and you can find details on them stored in teh dataset inside the the DATA_PATH/meta/skills.json
|
||||
|
||||
you will find something like
|
||||
|
||||
{
|
||||
"coarse_description": "pink lego brick into the transparent box",
|
||||
"skill_to_task_index": {
|
||||
"robot arm picks up pink lego brick": 19,
|
||||
"robot arm approaches transparent box": 3,
|
||||
"robot arm retracts from transparent box": 28,
|
||||
"robot arm moves towards pink lego brick": 12,
|
||||
"robot arm releases red lego brick into box": 26,
|
||||
"robot arm releases red lego brick into transparent box": 27,
|
||||
"robot arm closes gripper to pick up the pink lego brick": 5,
|
||||
"robot arm lifts the pink lego brick": 7,
|
||||
etc..
|
||||
},
|
||||
"episodes": {
|
||||
"0": {
|
||||
"episode_index": 0,
|
||||
"description": "pink lego brick into the transparent box",
|
||||
"skills": [
|
||||
{
|
||||
"name": "robot arm moves towards pink lego brick",
|
||||
"start": 0.0,
|
||||
"end": 1.8
|
||||
},
|
||||
{
|
||||
"name": "robot arm picks up pink lego brick",
|
||||
"start": 1.8,
|
||||
"end": 3.1
|
||||
},
|
||||
{
|
||||
"name": "robot arm moves towards transparent box",
|
||||
"start": 3.1,
|
||||
"end": 5.5
|
||||
},
|
||||
{
|
||||
"name": "robot arm releases pink lego brick into transparent box",
|
||||
"start": 5.5,
|
||||
"end": 7.0
|
||||
},
|
||||
{
|
||||
"name": "robot arm retracts from transparent box",
|
||||
"start": 7.0,
|
||||
"end": 10.1
|
||||
}
|
||||
]
|
||||
},
|
||||
"1": {
|
||||
"episode_index": 1,
|
||||
"description": "pink lego brick into the transparent box",
|
||||
"skills": [
|
||||
{
|
||||
"name": "robot arm moves towards red lego brick",
|
||||
"start": 0.0,
|
||||
"end": 1.2
|
||||
},
|
||||
{
|
||||
"name": "robot arm picks up red lego brick",
|
||||
"start": 1.2,
|
||||
"end": 2.0
|
||||
},
|
||||
{
|
||||
"name": "robot arm moves towards transparent box",
|
||||
"start": 2.0,
|
||||
"end": 3.8
|
||||
},
|
||||
{
|
||||
"name": "robot arm places red lego brick into transparent box",
|
||||
"start": 3.8,
|
||||
"end": 5.0
|
||||
},
|
||||
{
|
||||
"name": "robot arm moves away from transparent box",
|
||||
"start": 5.0,
|
||||
"end": 8.9
|
||||
}
|
||||
]
|
||||
},
|
||||
|
||||
notice how task_description: is a high-level description (e.g., "make a sandwich") stored in description for each episode
|
||||
|
||||
For each sample, call Qwen VLM to generate:
|
||||
|
||||
synthetic user prompt ℓ_t
|
||||
|
||||
synthetic robot response u_t
|
||||
|
||||
Save results to D_syn in Parquet format insdie DATA_PATH/meta/tasks.parquet ; note tasks.parquet already contains the other tasks, so you need to update
|
||||
|
||||
Should be modular, clean, easy to extend, with:
|
||||
|
||||
a PGEN_PROMPT_TEMPLATE
|
||||
|
||||
a construct_prompt() method
|
||||
|
||||
a call_qwen() method
|
||||
|
||||
a annotate_sample() method
|
||||
|
||||
a CLI entrypoint (if __name__ == "__main__":)
|
||||
|
||||
📦 INPUT FORMAT (Dlabeled)
|
||||
|
||||
The script should expect Dlabeled as a .jsonl file where each line has:
|
||||
|
||||
{
|
||||
"episode_id": "ep_001",
|
||||
"t": 37,
|
||||
"images": ["path/to/cam0_t.jpg", "path/to/cam1_t.jpg"],
|
||||
"skill_current": "pick up the KitKat",
|
||||
"skill_history": ["open fridge", "pick up lettuce", "place lettuce"],
|
||||
"task_description": "making a sandwich"
|
||||
}
|
||||
|
||||
📤 OUTPUT FORMAT (D_syn)
|
||||
|
||||
Each line of synthetically generated data should be:
|
||||
|
||||
{
|
||||
"episode_id": "ep_001",
|
||||
"t": 37,
|
||||
"images": ["path/to/cam0_t.jpg", "path/to/cam1_t.jpg"],
|
||||
"skill_current": "pick up the KitKat",
|
||||
"skill_history": [...],
|
||||
"user_prompt": "Can you grab me something sweet?",
|
||||
"robot_utterance": "Sure, I can pick up the KitKat.",
|
||||
"task_description": "making a sandwich"
|
||||
}
|
||||
|
||||
|
||||
Store as syn_annotations.jsonl. for debugging
|
||||
|
||||
🧠 pgen MODEL (Qwen) REQUIREMENTS
|
||||
|
||||
Use HuggingFace Transformers:
|
||||
|
||||
Qwen/Qwen2-VL-7B-Instruct (or any Qwen2-VL Vision-Language model available)
|
||||
|
||||
Use the image + text chat interface
|
||||
|
||||
Vision inputs should be loaded with PIL
|
||||
|
||||
Use a single forward pass that outputs BOTH ℓ_t and u_t in a structured JSON
|
||||
|
||||
📝 PROMPT FORMAT FOR pgen
|
||||
|
||||
Create a template like:
|
||||
|
||||
You are a robot-assistant dialogue generator for hierarchical robot policies.
|
||||
|
||||
You will receive:
|
||||
- A list of images showing the current robot scene.
|
||||
- The high-level task: {task_description}
|
||||
- Previous skill steps completed: {skill_history}
|
||||
- The next skill to be performed by the robot: {skill_current}
|
||||
|
||||
Generate two things in JSON:
|
||||
1. "user_prompt": a natural-sounding user request that logically leads to the robot performing the skill "{skill_current}" given the task and history.
|
||||
2. "robot_utterance": a natural robot reply acknowledging or clarifying the request.
|
||||
|
||||
The responses must be grounded in the visual scene, the task, and the skill history.
|
||||
|
||||
Respond ONLY in JSON:
|
||||
{
|
||||
"user_prompt": "...",
|
||||
"robot_utterance": "..."
|
||||
}
|
||||
|
||||
This resposne will have a corresponsing task_index, and the task will be saved in task.parqeut and you must update each dataset parquet in for example /fsx/jade_choghari/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace/data/chunk-000/
|
||||
file-000.parquet to include this new feature called task_index_high_level consider udpatign the metadata in info.json as well
|
||||
📌 LOGIC REQUIRED
|
||||
construct_prompt(sample)
|
||||
|
||||
Loads sample dict
|
||||
|
||||
Inserts:
|
||||
|
||||
task_description
|
||||
|
||||
skill_history
|
||||
|
||||
skill_current
|
||||
|
||||
Returns a full text prompt string
|
||||
|
||||
call_qwen(images, prompt)
|
||||
|
||||
Loads images into Qwen-VL multimodal input format
|
||||
|
||||
Calls model.generate
|
||||
|
||||
Parses JSON output
|
||||
|
||||
annotate_sample(sample)
|
||||
|
||||
Builds prompt
|
||||
|
||||
Calls Qwen
|
||||
|
||||
Returns augmented sample with user_prompt + robot_utterance
|
||||
|
||||
🚀 CLI Usage
|
||||
|
||||
The script should run as:
|
||||
|
||||
python annotate_pgen.py \
|
||||
--output-dir PATH \
|
||||
--model Qwen/Qwen2-VL-7B-Instruct \
|
||||
--repo-id lerobot/svla_so101_pickplace \
|
||||
--model Qwen/Qwen3-VL-30B-A3B-Instruct \
|
||||
--batch-size 1
|
||||
|
||||
|
||||
Include arguments via argparse.
|
||||
|
||||
🔧 OTHER REQUIREMENTS
|
||||
|
||||
Use tqdm for progress bars
|
||||
|
||||
Log errors gracefully and continue
|
||||
|
||||
Support GPU acceleration (device="cuda")
|
||||
|
||||
Cache model loading so it's not reloaded every call
|
||||
|
||||
Make the prompt deterministic but allow temperature parameter
|
||||
|
||||
Add a flag --num-image-views-per-sample
|
||||
|
||||
Add automatic JSON parsing with helpful error messages
|
||||
|
||||
🎯 FINAL DELIVERABLE
|
||||
|
||||
Cursor must now generate:
|
||||
A full Python file named annotate_pgen.py implementing the above functionality end-to-end.
|
||||
|
||||
It should be production-ready, runnable on real data, cleanly structured, and easy to modify.
|
||||
|
||||
|
||||
from the paper:
|
||||
Next, we use a large vision-language model (VLM) pgen
|
||||
to produce synthetic user prompts and interjections ℓt,
|
||||
and corresponding robot utterance ut. Given Dlabeled, we
|
||||
prompt pgen with both the visual context I1
|
||||
t ,...,In
|
||||
t and the
|
||||
skill labelˆ
|
||||
ℓt (e.g., pick up the lettuce). pgen then imag-
|
||||
ines an appropriate interaction that might have led toˆ
|
||||
ℓt in a
|
||||
real user interaction: it generates possible user prompts ℓt
|
||||
(e.g., “Can you add some lettuce for me?”) along with the
|
||||
robot’s verbal responses and clarifications ut. We detail the
|
||||
A. Synthetic Data Generation
|
||||
A.1. Scenario and Response Categorization
|
||||
To ensure the quality and diversity of the synthetic data,
|
||||
we incorporate structured scenario classification and re-
|
||||
sponse categorization into the prompt design for pgen, fol-
|
||||
lowing (Stephan et al., 2024). Specifically, we classify
|
||||
interactions into different scenario types, such as nega-
|
||||
tive task (where the user instructs the robot what not to
|
||||
do), situated correction (where the user adjusts an earlier
|
||||
command based on the evolving task state), and specific
|
||||
constraint (where the user specifies particular constraints,
|
||||
such as dietary preferences). In addition, we categorize
|
||||
the robot’s responses into types such as simple confirma-
|
||||
tions, clarifications, and error handling. These classifica-
|
||||
tions guide the generation process to ensure a broad range
|
||||
of user-robot interactions.
|
||||
A.2. Prompt Construction for Contextual Grounding
|
||||
In prompt P, we include a detailed description of the task
|
||||
(e.g., bussing a table, making a sandwich, grocery shop-
|
||||
ping) and instruct the model to ground responses in visual
|
||||
observations and prior context. A key advantage of lever-
|
||||
aging large pretrained VLMs is their ability to incorporate
|
||||
world knowledge when generating interactions. For in-
|
||||
stance, the model can infer dietary constraints when gener-
|
||||
ating prompts for sandwich-making, producing user com-
|
||||
mands such as “Can you make a sandwich for me? I’m
|
||||
lactose intolerant” and an appropriate robot response like
|
||||
“Sure, I won’t put cheese on it.” Similarly, it can reason
|
||||
over ambiguous or implicit requests, such as inferring that
|
||||
“I want something sweet” in a grocery shopping scenario
|
||||
should lead to suggestions like chocolate or candy.
|
||||
To maintain consistency in multi-step tasks, we condition
|
||||
pgen on prior skill labels within an episodeˆ
|
||||
ˆ
|
||||
ℓ0,...,
|
||||
ℓt−1,
|
||||
allowing it to generate coherent user commands that
|
||||
account for past actions. For instance, if the robot
|
||||
has already placed lettuce and tomato on a sandwich,
|
||||
the generated user prompt might request additional in-
|
||||
gredients that logically follow. This ensures that the
|
||||
synthetic interactions reflect realistic task progression
|
||||
rather than isolated commands. As such, we leverage
|
||||
ˆ
|
||||
ˆ
|
||||
ˆ
|
||||
pgen(ℓt,ut|I1
|
||||
t ,...,In
|
||||
t ,
|
||||
ℓ0,...,
|
||||
ℓt−1,
|
||||
ℓt,P) to produce a richer,
|
||||
more diverse synthetic dataset Dsyn that provides mean-
|
||||
ingful supervision for training our high-level policy.
|
||||
While in this work we generate a separate Dsyn and train
|
||||
a separate high-level policy for each task (e.g., sandwich
|
||||
making vs. table cleaning) for clarity and ease of bench-
|
||||
marking, the architecture is readily amenable to a unified
|
||||
multi-task formulation. In principle, the same hierarchical
|
||||
approach could be used to train a single high-level policy
|
||||
across a multitude of tasks, facilitating knowledge transfer
|
||||
|
||||
|
||||
The result should be a new LeRobotDataset with a new feature called task_index_high_level inside each dataset parquet
|
||||
@@ -0,0 +1,11 @@
|
||||
python examples/dataset/annotate.py \
|
||||
--repo-id jadechoghari/collect-data \
|
||||
--video-key observation.images.base \
|
||||
--model Qwen/Qwen3-VL-30B-A3B-Instruct \
|
||||
--episodes 16 22
|
||||
|
||||
# python examples/dataset/annotate.py \
|
||||
# --repo-id lerobot/svla_so101_pickplace \
|
||||
# --video-key observation.images.side \
|
||||
# --model Qwen/Qwen3-VL-30B-A3B-Instruct \
|
||||
# --episodes 5
|
||||
@@ -0,0 +1,43 @@
|
||||
#!/bin/bash
|
||||
|
||||
# Example script to run synthetic data generation with Qwen VLM
|
||||
# This generates user prompts and robot utterances for hierarchical policy training
|
||||
|
||||
# Configuration
|
||||
REPO_ID="jadechoghari/collect-data"
|
||||
MODEL="Qwen/Qwen3-VL-30B-A3B-Instruct"
|
||||
# Alternative: MODEL="Qwen/Qwen2-VL-7B-Instruct"
|
||||
|
||||
|
||||
OUTPUT_DIR="/fsx/jade_choghari/outputs/collect-data-pgen"
|
||||
BATCH_SIZE=32
|
||||
TEMPERATURE=0.9
|
||||
SAMPLE_INTERVAL=5.0 # Generate dialogue every 1 second (all episodes processed)
|
||||
|
||||
# Run synthetic data generation (processes ALL episodes)
|
||||
python examples/dataset/annotate_pgen.py \
|
||||
--repo-id "$REPO_ID" \
|
||||
--model "$MODEL" \
|
||||
--output-dir "$OUTPUT_DIR" \
|
||||
--temperature "$TEMPERATURE" \
|
||||
--batch-size "$BATCH_SIZE" \
|
||||
--sample-interval "$SAMPLE_INTERVAL" \
|
||||
--image-key observation.images.base \
|
||||
--num-image-views-per-sample 1
|
||||
|
||||
# For faster testing, increase sample interval:
|
||||
# --sample-interval 5.0 # Samples every 5 seconds (much faster)
|
||||
|
||||
# To push to hub after generation:
|
||||
# Add --push-to-hub flag
|
||||
|
||||
# Efficient batch processing: 4 episodes at once
|
||||
# python examples/dataset/annotate_pgen.py \
|
||||
# --repo-id "$REPO_ID" \
|
||||
# --model "$MODEL" \
|
||||
# --output-dir "$OUTPUT_DIR" \
|
||||
# --video-mode \
|
||||
# --video-key observation.images.up \
|
||||
# --video-batch-size "$BATCH_SIZE" \
|
||||
# --sample-interval 1.0
|
||||
|
||||
@@ -59,7 +59,6 @@ python examples/dataset_annotation/subtask_annotation.py \
|
||||
import argparse
|
||||
import json
|
||||
import multiprocessing as mp
|
||||
import random
|
||||
import re
|
||||
import subprocess
|
||||
import tempfile
|
||||
@@ -67,100 +66,21 @@ import textwrap
|
||||
import time
|
||||
from concurrent.futures import ProcessPoolExecutor, as_completed
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import cv2
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import torch
|
||||
from pydantic import BaseModel, Field
|
||||
from qwen_vl_utils import process_vision_info
|
||||
from rich.console import Console
|
||||
from transformers import AutoProcessor, Qwen3VLMoeForConditionalGeneration
|
||||
|
||||
from lerobot.datasets.lerobot_dataset import LeRobotDataset
|
||||
|
||||
|
||||
# Pydantic Models for SARM Subtask Annotation
|
||||
class Timestamp(BaseModel):
|
||||
"""Timestamp in MM:SS or SS format"""
|
||||
|
||||
start: str = Field(description="Start timestamp (MM:SS or just seconds)")
|
||||
end: str = Field(description="End timestamp (MM:SS or just seconds)")
|
||||
|
||||
|
||||
class Subtask(BaseModel):
|
||||
"""Individual subtask/stage - must use EXACT names from provided list"""
|
||||
|
||||
name: str = Field(description="Subtask name - MUST match one from the predefined list exactly")
|
||||
timestamps: Timestamp
|
||||
|
||||
|
||||
class SubtaskAnnotation(BaseModel):
|
||||
"""Complete annotation for a robot manipulation episode"""
|
||||
|
||||
subtasks: list[Subtask] = Field(description="List of all subtasks in temporal order")
|
||||
|
||||
|
||||
def compute_temporal_proportions(
|
||||
annotations: dict[int, Any], fps: int = 30, subtask_order: list[str] | None = None
|
||||
) -> dict[str, float]:
|
||||
"""
|
||||
Compute dataset-level temporal proportions (priors) for each subtask.
|
||||
|
||||
Implements SARM Paper Formula (1): ᾱ_k = (1/M) × Σ_i (L_{i,k} / T_i)
|
||||
|
||||
Args:
|
||||
annotations: Dict mapping episode index to SubtaskAnnotation object.
|
||||
fps: Frames per second (unused, kept for API compatibility)
|
||||
subtask_order: Optional list defining the output order of subtasks.
|
||||
|
||||
Returns:
|
||||
Dict mapping subtask name to its temporal proportion (ᾱ_k), ordered by subtask_order if provided.
|
||||
"""
|
||||
subtask_proportions: dict[str, list[float]] = {}
|
||||
|
||||
for annotation in annotations.values():
|
||||
total_duration = 0
|
||||
durations: dict[str, int] = {}
|
||||
|
||||
for subtask in annotation.subtasks:
|
||||
start_parts = subtask.timestamps.start.split(":")
|
||||
end_parts = subtask.timestamps.end.split(":")
|
||||
|
||||
start_seconds = (
|
||||
int(start_parts[0]) * 60 + int(start_parts[1])
|
||||
if len(start_parts) == 2
|
||||
else int(start_parts[0])
|
||||
)
|
||||
end_seconds = (
|
||||
int(end_parts[0]) * 60 + int(end_parts[1]) if len(end_parts) == 2 else int(end_parts[0])
|
||||
)
|
||||
|
||||
duration = end_seconds - start_seconds
|
||||
durations[subtask.name] = duration
|
||||
total_duration += duration
|
||||
|
||||
if total_duration > 0:
|
||||
for name, duration in durations.items():
|
||||
if name not in subtask_proportions:
|
||||
subtask_proportions[name] = []
|
||||
subtask_proportions[name].append(duration / total_duration)
|
||||
|
||||
if not subtask_proportions:
|
||||
return {}
|
||||
|
||||
avg_proportions = {name: sum(props) / len(props) for name, props in subtask_proportions.items()}
|
||||
|
||||
total = sum(avg_proportions.values())
|
||||
if total > 0:
|
||||
avg_proportions = {name: prop / total for name, prop in avg_proportions.items()}
|
||||
|
||||
# Reorder according to subtask_order if provided
|
||||
if subtask_order:
|
||||
avg_proportions = {
|
||||
name: avg_proportions.get(name, 0.0) for name in subtask_order if name in avg_proportions
|
||||
}
|
||||
|
||||
return avg_proportions
|
||||
from lerobot.policies.sarm.sarm_utils import (
|
||||
Subtask,
|
||||
SubtaskAnnotation,
|
||||
Timestamp,
|
||||
compute_temporal_proportions,
|
||||
)
|
||||
|
||||
|
||||
def create_sarm_prompt(subtask_list: list[str]) -> str:
|
||||
@@ -257,8 +177,8 @@ class VideoAnnotator:
|
||||
model_name: str = "Qwen/Qwen3-VL-30B-A3B-Instruct",
|
||||
device: str = "cuda",
|
||||
torch_dtype: torch.dtype = torch.bfloat16,
|
||||
model: Qwen3VLMoeForConditionalGeneration | None = None, # noqa: F821
|
||||
processor: AutoProcessor | None = None, # noqa: F821
|
||||
model: "Qwen3VLMoeForConditionalGeneration | None" = None,
|
||||
processor: "AutoProcessor | None" = None,
|
||||
):
|
||||
"""
|
||||
Initialize the video annotator with local model.
|
||||
@@ -273,17 +193,16 @@ class VideoAnnotator:
|
||||
"""
|
||||
self.subtask_list = subtask_list
|
||||
self.prompt = create_sarm_prompt(subtask_list)
|
||||
self.console = Console()
|
||||
self.device = device
|
||||
|
||||
# Use provided model/processor or load new ones
|
||||
if model is not None and processor is not None:
|
||||
self.model = model
|
||||
self.processor = processor
|
||||
print(f"Using shared model on {device}")
|
||||
self.console.print(f"[green]✓ Using shared model on {device}[/green]")
|
||||
else:
|
||||
from transformers import AutoProcessor, Qwen3VLMoeForConditionalGeneration
|
||||
|
||||
print(f"Loading model: {model_name}...")
|
||||
self.console.print(f"[cyan]Loading model: {model_name}...[/cyan]")
|
||||
|
||||
self.model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
|
||||
model_name, torch_dtype=torch_dtype, device_map=device, trust_remote_code=True
|
||||
@@ -291,7 +210,7 @@ class VideoAnnotator:
|
||||
|
||||
self.processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
|
||||
|
||||
print(f"Model loaded successfully on {device}")
|
||||
self.console.print(f"[green]✓ Model loaded successfully on {device}[/green]")
|
||||
|
||||
def extract_episode_segment(
|
||||
self, file_path: Path, start_timestamp: float, end_timestamp: float, target_fps: int = 1
|
||||
@@ -310,22 +229,25 @@ class VideoAnnotator:
|
||||
Path to extracted video file
|
||||
"""
|
||||
# Create temporary file for extracted video
|
||||
with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as tmp_file:
|
||||
tmp_path = Path(tmp_file.name)
|
||||
tmp_file = tempfile.NamedTemporaryFile(suffix=".mp4", delete=False)
|
||||
tmp_path = Path(tmp_file.name)
|
||||
tmp_file.close()
|
||||
|
||||
try:
|
||||
# Check if ffmpeg is available
|
||||
subprocess.run( # nosec B607
|
||||
subprocess.run(
|
||||
["ffmpeg", "-version"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, check=True
|
||||
)
|
||||
except (subprocess.CalledProcessError, FileNotFoundError) as err:
|
||||
raise RuntimeError("ffmpeg not found, cannot extract episode segment") from err
|
||||
except (subprocess.CalledProcessError, FileNotFoundError):
|
||||
raise RuntimeError("ffmpeg not found, cannot extract episode segment") from e
|
||||
|
||||
try:
|
||||
# Calculate duration
|
||||
duration = end_timestamp - start_timestamp
|
||||
|
||||
print(f"Extracting episode: {start_timestamp:.1f}s-{end_timestamp:.1f}s ({duration:.1f}s)")
|
||||
self.console.print(
|
||||
f"[cyan]Extracting episode: {start_timestamp:.1f}s-{end_timestamp:.1f}s ({duration:.1f}s)[/cyan]"
|
||||
)
|
||||
|
||||
# Use ffmpeg to extract segment with minimal quality loss
|
||||
cmd = [
|
||||
@@ -353,7 +275,7 @@ class VideoAnnotator:
|
||||
|
||||
# Verify the output file was created and is not empty
|
||||
if not tmp_path.exists() or tmp_path.stat().st_size == 0:
|
||||
print("Video extraction failed (0 bytes) - skipping episode")
|
||||
self.console.print("[red]✗ Video extraction failed (0 bytes) - skipping episode[/red]")
|
||||
if tmp_path.exists():
|
||||
tmp_path.unlink()
|
||||
raise RuntimeError("FFmpeg produced empty video file")
|
||||
@@ -363,11 +285,13 @@ class VideoAnnotator:
|
||||
|
||||
# Fail if file is too small (< 100KB likely means extraction failed)
|
||||
if file_size_mb < 0.1:
|
||||
print(f"Extracted video too small ({file_size_mb:.2f}MB) - skipping episode")
|
||||
self.console.print(
|
||||
f"[red]✗ Extracted video too small ({file_size_mb:.2f}MB) - skipping episode[/red]"
|
||||
)
|
||||
tmp_path.unlink()
|
||||
raise RuntimeError(f"Video extraction produced invalid file ({file_size_mb:.2f}MB)")
|
||||
|
||||
print(f"Extracted: {file_size_mb:.1f}MB ({target_fps} FPS)")
|
||||
self.console.print(f"[green]✓ Extracted: {file_size_mb:.1f}MB ({target_fps} FPS)[/green]")
|
||||
|
||||
return tmp_path
|
||||
|
||||
@@ -383,8 +307,6 @@ class VideoAnnotator:
|
||||
max_retries: int = 3,
|
||||
) -> SubtaskAnnotation:
|
||||
"""Annotate a video segment using local GPU."""
|
||||
from qwen_vl_utils import process_vision_info
|
||||
|
||||
file_path = Path(file_path)
|
||||
|
||||
if end_timestamp is None:
|
||||
@@ -433,7 +355,7 @@ class VideoAnnotator:
|
||||
)
|
||||
|
||||
response = self.processor.batch_decode(
|
||||
[out[len(inp) :] for inp, out in zip(inputs.input_ids, generated_ids, strict=True)],
|
||||
[out[len(inp) :] for inp, out in zip(inputs.input_ids, generated_ids)],
|
||||
skip_special_tokens=True,
|
||||
)[0].strip()
|
||||
|
||||
@@ -449,7 +371,7 @@ class VideoAnnotator:
|
||||
match = re.search(r"\{.*\}", response, re.DOTALL)
|
||||
if match:
|
||||
return SubtaskAnnotation.model_validate(json.loads(match.group()))
|
||||
raise ValueError("No JSON found") from None
|
||||
raise ValueError("No JSON found")
|
||||
except Exception as e:
|
||||
if attempt == max_retries - 1:
|
||||
raise RuntimeError(f"Failed after {max_retries} attempts") from e
|
||||
@@ -459,12 +381,16 @@ class VideoAnnotator:
|
||||
extracted_path.unlink()
|
||||
|
||||
|
||||
def display_annotation(annotation: SubtaskAnnotation, episode_idx: int, fps: int, prefix: str = ""):
|
||||
def display_annotation(
|
||||
annotation: SubtaskAnnotation, console: Console, episode_idx: int, fps: int, prefix: str = ""
|
||||
):
|
||||
"""Display annotation summary."""
|
||||
subtask_summary = ", ".join(
|
||||
f"{s.name}({s.timestamps.start}-{s.timestamps.end})" for s in annotation.subtasks
|
||||
)
|
||||
print(f"Episode {episode_idx} {prefix}: {len(annotation.subtasks)} subtasks - {subtask_summary}")
|
||||
console.print(
|
||||
f"[green]Episode {episode_idx} {prefix}: {len(annotation.subtasks)} subtasks - {subtask_summary}[/green]"
|
||||
)
|
||||
|
||||
|
||||
def timestamp_to_seconds(timestamp: str) -> float:
|
||||
@@ -476,272 +402,6 @@ def timestamp_to_seconds(timestamp: str) -> float:
|
||||
return int(parts[0])
|
||||
|
||||
|
||||
def extract_frame(video_path: Path, timestamp: float) -> np.ndarray | None:
|
||||
"""Extract a single frame from video at given timestamp."""
|
||||
cap = cv2.VideoCapture(str(video_path))
|
||||
if not cap.isOpened():
|
||||
return None
|
||||
cap.set(cv2.CAP_PROP_POS_MSEC, timestamp * 1000)
|
||||
ret, frame = cap.read()
|
||||
cap.release()
|
||||
return cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) if ret else None
|
||||
|
||||
|
||||
def draw_timeline(ax, subtasks, total_duration, colors):
|
||||
"""Draw a timeline with color-coded subtask segments."""
|
||||
import matplotlib.patches as mpatches
|
||||
|
||||
bar_height, bar_y = 0.6, 0.5
|
||||
|
||||
for i, subtask in enumerate(subtasks):
|
||||
start = timestamp_to_seconds(subtask.timestamps.start)
|
||||
end = timestamp_to_seconds(subtask.timestamps.end)
|
||||
color = colors[i % len(colors)]
|
||||
|
||||
rect = mpatches.FancyBboxPatch(
|
||||
(start, bar_y - bar_height / 2),
|
||||
end - start,
|
||||
bar_height,
|
||||
boxstyle="round,pad=0.02,rounding_size=0.1",
|
||||
facecolor=color,
|
||||
edgecolor="white",
|
||||
linewidth=1.5,
|
||||
alpha=0.85,
|
||||
)
|
||||
ax.add_patch(rect)
|
||||
|
||||
# Add label if segment is wide enough
|
||||
duration = end - start
|
||||
if duration > total_duration * 0.06:
|
||||
ax.text(
|
||||
(start + end) / 2,
|
||||
bar_y,
|
||||
subtask.name,
|
||||
ha="center",
|
||||
va="center",
|
||||
fontsize=8,
|
||||
fontweight="bold",
|
||||
color="white",
|
||||
rotation=0 if duration > total_duration * 0.12 else 45,
|
||||
)
|
||||
|
||||
if i > 0:
|
||||
ax.axvline(x=start, ymin=0.1, ymax=0.9, color="white", linestyle="--", linewidth=1.5, alpha=0.7)
|
||||
|
||||
ax.axvline(x=0, ymin=0.1, ymax=0.9, color="#00ff00", linestyle="-", linewidth=2, alpha=0.9)
|
||||
if subtasks:
|
||||
ax.axvline(
|
||||
x=timestamp_to_seconds(subtasks[-1].timestamps.end),
|
||||
ymin=0.1,
|
||||
ymax=0.9,
|
||||
color="white",
|
||||
linestyle="--",
|
||||
linewidth=1.5,
|
||||
alpha=0.7,
|
||||
)
|
||||
|
||||
ax.set_xlim(-total_duration * 0.02, total_duration * 1.02)
|
||||
ax.set_ylim(-0.1, 1.1)
|
||||
ax.set_xlabel("Time (seconds)", fontsize=10, color="white", labelpad=5)
|
||||
for spine in ["top", "right", "left"]:
|
||||
ax.spines[spine].set_visible(False)
|
||||
ax.spines["bottom"].set_color("#444444")
|
||||
ax.tick_params(axis="x", colors="#888888", labelsize=8)
|
||||
ax.tick_params(axis="y", left=False, labelleft=False)
|
||||
|
||||
|
||||
def visualize_episode(
|
||||
ep_idx: int,
|
||||
annotation: SubtaskAnnotation,
|
||||
video_path: Path,
|
||||
video_start: float,
|
||||
video_end: float,
|
||||
output_path: Path,
|
||||
video_key: str,
|
||||
ann_type: str,
|
||||
):
|
||||
"""Create visualization for a single episode with frames and timeline."""
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
if annotation is None:
|
||||
print(f"No {ann_type} annotation for episode {ep_idx}")
|
||||
return
|
||||
|
||||
subtasks = annotation.subtasks
|
||||
if not subtasks:
|
||||
print(f"No subtasks for episode {ep_idx}")
|
||||
return
|
||||
|
||||
colors = plt.cm.tab10(np.linspace(0, 1, max(len(subtasks), 10)))
|
||||
total_duration = timestamp_to_seconds(subtasks[-1].timestamps.end)
|
||||
|
||||
# Extract middle frame from each subtask
|
||||
sample_frames, frame_times = [], []
|
||||
for subtask in subtasks:
|
||||
start = timestamp_to_seconds(subtask.timestamps.start)
|
||||
end = timestamp_to_seconds(subtask.timestamps.end)
|
||||
mid = (start + end) / 2
|
||||
frame_times.append(mid)
|
||||
sample_frames.append(extract_frame(video_path, video_start + mid))
|
||||
|
||||
# Create figure
|
||||
fig_width = max(16, len(subtasks) * 2.5)
|
||||
fig = plt.figure(figsize=(fig_width, 10))
|
||||
fig.patch.set_facecolor("#1a1a2e")
|
||||
|
||||
gs = fig.add_gridspec(
|
||||
2,
|
||||
max(len(subtasks), 1),
|
||||
height_ratios=[2, 1],
|
||||
hspace=0.3,
|
||||
wspace=0.1,
|
||||
left=0.05,
|
||||
right=0.95,
|
||||
top=0.88,
|
||||
bottom=0.1,
|
||||
)
|
||||
|
||||
fig.suptitle(
|
||||
f"Episode {ep_idx} - {ann_type.capitalize()} Annotations",
|
||||
fontsize=18,
|
||||
fontweight="bold",
|
||||
color="white",
|
||||
y=0.96,
|
||||
)
|
||||
fig.text(
|
||||
0.5,
|
||||
0.91,
|
||||
f"Camera: {video_key} | Duration: {video_end - video_start:.1f}s | {len(subtasks)} subtasks",
|
||||
ha="center",
|
||||
fontsize=11,
|
||||
color="#888888",
|
||||
)
|
||||
|
||||
# Plot frames
|
||||
for i, (frame, subtask) in enumerate(zip(sample_frames, subtasks, strict=True)):
|
||||
ax = fig.add_subplot(gs[0, i])
|
||||
ax.set_facecolor("#16213e")
|
||||
if frame is not None:
|
||||
ax.imshow(frame)
|
||||
else:
|
||||
ax.text(
|
||||
0.5, 0.5, "N/A", ha="center", va="center", fontsize=12, color="white", transform=ax.transAxes
|
||||
)
|
||||
ax.set_title(subtask.name, fontsize=10, fontweight="bold", color=colors[i % len(colors)], pad=8)
|
||||
ax.axis("off")
|
||||
ax.text(
|
||||
0.5,
|
||||
-0.08,
|
||||
f"t={frame_times[i]:.1f}s",
|
||||
ha="center",
|
||||
fontsize=9,
|
||||
color="#888888",
|
||||
transform=ax.transAxes,
|
||||
)
|
||||
|
||||
# Plot timeline
|
||||
ax_timeline = fig.add_subplot(gs[1, :])
|
||||
ax_timeline.set_facecolor("#16213e")
|
||||
draw_timeline(ax_timeline, subtasks, total_duration, colors)
|
||||
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
plt.savefig(output_path, dpi=150, facecolor=fig.get_facecolor(), edgecolor="none", bbox_inches="tight")
|
||||
plt.close()
|
||||
print(f"Saved: {output_path}")
|
||||
|
||||
|
||||
def visualize_annotations(
|
||||
dataset: LeRobotDataset,
|
||||
sparse_annotations: dict[int, SubtaskAnnotation],
|
||||
dense_annotations: dict[int, SubtaskAnnotation] | None,
|
||||
video_key: str,
|
||||
output_dir: Path,
|
||||
num_episodes: int = 5,
|
||||
annotation_type: str = "sparse",
|
||||
episode_indices: list[int] | None = None,
|
||||
):
|
||||
"""
|
||||
Visualize subtask annotations for a set of episodes.
|
||||
|
||||
Args:
|
||||
dataset: LeRobotDataset instance
|
||||
sparse_annotations: Dict mapping episode index to sparse annotations
|
||||
dense_annotations: Dict mapping episode index to dense annotations (or None)
|
||||
video_key: Camera/video key to use
|
||||
output_dir: Directory to save visualization images
|
||||
num_episodes: Number of episodes to visualize (ignored if episode_indices provided)
|
||||
annotation_type: "sparse", "dense", or "both"
|
||||
episode_indices: Specific episode indices to visualize (optional)
|
||||
"""
|
||||
# Determine available episodes based on annotation type
|
||||
if annotation_type == "sparse":
|
||||
available = set(sparse_annotations.keys())
|
||||
elif annotation_type == "dense":
|
||||
available = set(dense_annotations.keys()) if dense_annotations else set()
|
||||
else: # both
|
||||
sparse_set = set(sparse_annotations.keys())
|
||||
dense_set = set(dense_annotations.keys()) if dense_annotations else set()
|
||||
available = sparse_set | dense_set
|
||||
|
||||
if not available:
|
||||
print("Error: No annotations found to visualize.")
|
||||
return
|
||||
|
||||
# Select episodes to visualize
|
||||
if episode_indices:
|
||||
episodes = sorted([e for e in episode_indices if e in available])
|
||||
missing = set(episode_indices) - available
|
||||
if missing:
|
||||
print(f"Episodes not found in annotations: {sorted(missing)}")
|
||||
else:
|
||||
episodes = sorted(random.sample(list(available), min(num_episodes, len(available))))
|
||||
print(f"Visualizing {len(episodes)} episodes: {episodes}")
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Generate visualizations
|
||||
for i, ep_idx in enumerate(episodes, 1):
|
||||
print(f"Processing episode {ep_idx} ({i}/{len(episodes)})")
|
||||
video_path = dataset.root / dataset.meta.get_video_file_path(ep_idx, video_key)
|
||||
if not video_path.exists():
|
||||
print(f"Video not found: {video_path}")
|
||||
continue
|
||||
|
||||
video_start = float(dataset.meta.episodes[f"videos/{video_key}/from_timestamp"][ep_idx])
|
||||
video_end = float(dataset.meta.episodes[f"videos/{video_key}/to_timestamp"][ep_idx])
|
||||
|
||||
if annotation_type == "both":
|
||||
# Visualize both sparse and dense
|
||||
for ann_type, annotations in [("sparse", sparse_annotations), ("dense", dense_annotations)]:
|
||||
if annotations and ep_idx in annotations:
|
||||
output_path = output_dir / f"episode_{ep_idx:04d}_{ann_type}.png"
|
||||
visualize_episode(
|
||||
ep_idx,
|
||||
annotations.get(ep_idx),
|
||||
video_path,
|
||||
video_start,
|
||||
video_end,
|
||||
output_path,
|
||||
video_key,
|
||||
ann_type,
|
||||
)
|
||||
else:
|
||||
annotations = sparse_annotations if annotation_type == "sparse" else dense_annotations
|
||||
if annotations and ep_idx in annotations:
|
||||
output_path = output_dir / f"episode_{ep_idx:04d}_{annotation_type}.png"
|
||||
visualize_episode(
|
||||
ep_idx,
|
||||
annotations.get(ep_idx),
|
||||
video_path,
|
||||
video_start,
|
||||
video_end,
|
||||
output_path,
|
||||
video_key,
|
||||
annotation_type,
|
||||
)
|
||||
|
||||
print(f"Visualizations saved to: {output_dir.absolute()}")
|
||||
|
||||
|
||||
def save_annotations_to_dataset(
|
||||
dataset_path: Path, annotations: dict[int, SubtaskAnnotation], fps: int, prefix: str = "sparse"
|
||||
):
|
||||
@@ -873,7 +533,7 @@ def load_annotations_from_dataset(dataset_path: Path, prefix: str = "sparse") ->
|
||||
end=f"{int(e) // 60:02d}:{int(e) % 60:02d}",
|
||||
),
|
||||
)
|
||||
for n, s, e in zip(names, starts, ends, strict=True)
|
||||
for n, s, e in zip(names, starts, ends)
|
||||
]
|
||||
)
|
||||
return annotations
|
||||
@@ -886,6 +546,7 @@ def process_single_episode(
|
||||
video_key: str,
|
||||
fps: int,
|
||||
annotator: VideoAnnotator,
|
||||
console: Console,
|
||||
) -> tuple[int, SubtaskAnnotation | None, str | None]:
|
||||
"""Process a single episode annotation."""
|
||||
try:
|
||||
@@ -913,6 +574,7 @@ def worker_process_episodes(
|
||||
) -> tuple[dict, dict | None]:
|
||||
"""Worker for parallel processing across GPUs."""
|
||||
device = f"cuda:{gpu_id}"
|
||||
console = Console()
|
||||
dataset = LeRobotDataset(repo_id, download_videos=False)
|
||||
|
||||
sparse_annotator = VideoAnnotator(sparse_subtask_list, model_name, device, torch_dtype)
|
||||
@@ -933,14 +595,14 @@ def worker_process_episodes(
|
||||
|
||||
for ep_idx in episode_indices:
|
||||
_, sparse_ann, err = process_single_episode(
|
||||
ep_idx, dataset.root, dataset.meta, video_key, dataset.fps, sparse_annotator
|
||||
ep_idx, dataset.root, dataset.meta, video_key, dataset.fps, sparse_annotator, console
|
||||
)
|
||||
if sparse_ann:
|
||||
sparse_annotations[ep_idx] = sparse_ann
|
||||
|
||||
if dense_annotator:
|
||||
_, dense_ann, _ = process_single_episode(
|
||||
ep_idx, dataset.root, dataset.meta, video_key, dataset.fps, dense_annotator
|
||||
ep_idx, dataset.root, dataset.meta, video_key, dataset.fps, dense_annotator, console
|
||||
)
|
||||
if dense_ann:
|
||||
dense_annotations[ep_idx] = dense_ann
|
||||
@@ -970,75 +632,15 @@ def main():
|
||||
parser.add_argument("--dtype", type=str, default="bfloat16", choices=["bfloat16", "float16", "float32"])
|
||||
parser.add_argument("--num-workers", type=int, default=1, help="Parallel workers for multi-GPU")
|
||||
parser.add_argument("--gpu-ids", type=int, nargs="+", default=None, help="GPU IDs to use")
|
||||
# Visualization options
|
||||
parser.add_argument(
|
||||
"--visualize-only",
|
||||
action="store_true",
|
||||
help="Only visualize existing annotations (no generation)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--num-visualizations",
|
||||
type=int,
|
||||
default=5,
|
||||
help="Number of episodes to visualize (default: 5)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--visualize-type",
|
||||
type=str,
|
||||
default="sparse",
|
||||
choices=["sparse", "dense", "both"],
|
||||
help="Type of annotations to visualize (default: sparse)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output-dir",
|
||||
type=str,
|
||||
default="./subtask_viz",
|
||||
help="Output directory for visualizations (default: ./subtask_viz)",
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
console = Console()
|
||||
|
||||
# Load dataset first (needed for both annotation and visualization)
|
||||
print(f"Loading dataset: {args.repo_id}")
|
||||
dataset = LeRobotDataset(args.repo_id, download_videos=True)
|
||||
fps = dataset.fps
|
||||
|
||||
if not dataset.meta.video_keys:
|
||||
raise ValueError("No video keys found")
|
||||
|
||||
video_key = (
|
||||
args.video_key if args.video_key in (dataset.meta.video_keys or []) else dataset.meta.video_keys[0]
|
||||
)
|
||||
print(f"Using camera: {video_key}, FPS: {fps}")
|
||||
|
||||
# Handle visualization-only mode
|
||||
if args.visualize_only:
|
||||
print("Visualization-only mode")
|
||||
sparse_annotations = load_annotations_from_dataset(dataset.root, prefix="sparse")
|
||||
dense_annotations = load_annotations_from_dataset(dataset.root, prefix="dense")
|
||||
|
||||
if not sparse_annotations and not dense_annotations:
|
||||
return print("Error: No annotations found. Run annotation first.")
|
||||
|
||||
print(f"Found {len(sparse_annotations)} sparse, {len(dense_annotations)} dense annotations")
|
||||
|
||||
visualize_annotations(
|
||||
dataset=dataset,
|
||||
sparse_annotations=sparse_annotations,
|
||||
dense_annotations=dense_annotations if dense_annotations else None,
|
||||
video_key=video_key,
|
||||
output_dir=Path(args.output_dir),
|
||||
num_episodes=args.num_visualizations,
|
||||
annotation_type=args.visualize_type,
|
||||
episode_indices=args.episodes,
|
||||
)
|
||||
return
|
||||
|
||||
# Validate arguments for annotation mode
|
||||
# Validate arguments
|
||||
if args.dense_only and not args.dense_subtasks:
|
||||
return print("Error: --dense-only requires --dense-subtasks")
|
||||
return console.print("[red]Error: --dense-only requires --dense-subtasks[/red]")
|
||||
if args.dense_subtasks and not args.sparse_subtasks and not args.dense_only:
|
||||
return print("Error: --dense-subtasks requires --sparse-subtasks or --dense-only")
|
||||
return console.print("[red]Error: --dense-subtasks requires --sparse-subtasks or --dense-only[/red]")
|
||||
|
||||
sparse_subtask_list = (
|
||||
[s.strip() for s in args.sparse_subtasks.split(",")] if args.sparse_subtasks else None
|
||||
@@ -1048,6 +650,18 @@ def main():
|
||||
dense_mode = dense_subtask_list is not None
|
||||
torch_dtype = {"bfloat16": torch.bfloat16, "float16": torch.float16, "float32": torch.float32}[args.dtype]
|
||||
|
||||
console.print(f"[cyan]Loading dataset: {args.repo_id}[/cyan]")
|
||||
dataset = LeRobotDataset(args.repo_id, download_videos=True)
|
||||
fps = dataset.fps
|
||||
|
||||
if not dataset.meta.video_keys:
|
||||
raise ValueError("No video keys found")
|
||||
|
||||
video_key = (
|
||||
args.video_key if args.video_key in (dataset.meta.video_keys or []) else dataset.meta.video_keys[0]
|
||||
)
|
||||
console.print(f"[cyan]Using camera: {video_key}, FPS: {fps}[/cyan]")
|
||||
|
||||
# Determine episodes
|
||||
episode_indices = args.episodes or list(range(dataset.meta.total_episodes))
|
||||
|
||||
@@ -1056,8 +670,8 @@ def main():
|
||||
episode_indices = [ep for ep in episode_indices if ep not in existing_annotations]
|
||||
|
||||
if not episode_indices:
|
||||
return print("All episodes already annotated!")
|
||||
print(f"Annotating {len(episode_indices)} episodes")
|
||||
return console.print("[green]All episodes already annotated![/green]")
|
||||
console.print(f"[cyan]Annotating {len(episode_indices)} episodes[/cyan]")
|
||||
|
||||
# GPU setup
|
||||
gpu_ids = args.gpu_ids or list(
|
||||
@@ -1072,7 +686,7 @@ def main():
|
||||
if auto_sparse:
|
||||
sparse_annotations.update(generate_auto_sparse_annotations(dataset, episode_indices, video_key))
|
||||
save_annotations_to_dataset(dataset.root, sparse_annotations, fps, prefix="sparse")
|
||||
print(f"Auto-generated {len(episode_indices)} sparse 'task' annotations")
|
||||
console.print(f"[green]Auto-generated {len(episode_indices)} sparse 'task' annotations[/green]")
|
||||
|
||||
# VLM annotation (for sparse if not auto, and for dense)
|
||||
need_vlm = (not auto_sparse) or dense_mode
|
||||
@@ -1080,7 +694,7 @@ def main():
|
||||
if need_vlm:
|
||||
if args.num_workers > 1 and not auto_sparse:
|
||||
# Parallel processing
|
||||
print(f"Parallel processing with {args.num_workers} workers")
|
||||
console.print(f"[cyan]Parallel processing with {args.num_workers} workers[/cyan]")
|
||||
episodes_per_worker = [[] for _ in range(args.num_workers)]
|
||||
for i, ep_idx in enumerate(episode_indices):
|
||||
episodes_per_worker[i % args.num_workers].append(ep_idx)
|
||||
@@ -1137,66 +751,52 @@ def main():
|
||||
)
|
||||
|
||||
for i, ep_idx in enumerate(episode_indices):
|
||||
print(f"Episode {ep_idx} ({i + 1}/{len(episode_indices)})")
|
||||
console.print(f"[cyan]Episode {ep_idx} ({i + 1}/{len(episode_indices)})[/cyan]")
|
||||
|
||||
if sparse_annotator:
|
||||
_, sparse_ann, err = process_single_episode(
|
||||
ep_idx, dataset.root, dataset.meta, video_key, fps, sparse_annotator
|
||||
ep_idx, dataset.root, dataset.meta, video_key, fps, sparse_annotator, console
|
||||
)
|
||||
if sparse_ann:
|
||||
sparse_annotations[ep_idx] = sparse_ann
|
||||
save_annotations_to_dataset(dataset.root, sparse_annotations, fps, prefix="sparse")
|
||||
elif err:
|
||||
print(f"Sparse failed: {err}")
|
||||
console.print(f"[red]Sparse failed: {err}[/red]")
|
||||
|
||||
if dense_annotator:
|
||||
_, dense_ann, err = process_single_episode(
|
||||
ep_idx, dataset.root, dataset.meta, video_key, fps, dense_annotator
|
||||
ep_idx, dataset.root, dataset.meta, video_key, fps, dense_annotator, console
|
||||
)
|
||||
if dense_ann:
|
||||
dense_annotations[ep_idx] = dense_ann
|
||||
save_annotations_to_dataset(dataset.root, dense_annotations, fps, prefix="dense")
|
||||
elif err:
|
||||
print(f"Dense failed: {err}")
|
||||
console.print(f"[red]Dense failed: {err}[/red]")
|
||||
|
||||
# Save temporal proportions
|
||||
def save_proportions(annotations, prefix, subtask_list=None, is_auto=False):
|
||||
props: dict[str, float] = (
|
||||
{"task": 1.0} if is_auto else compute_temporal_proportions(annotations, fps, subtask_list)
|
||||
)
|
||||
def save_proportions(annotations, prefix, is_auto=False):
|
||||
props: dict[str, float] = {"task": 1.0} if is_auto else compute_temporal_proportions(annotations, fps)
|
||||
path = dataset.root / "meta" / f"temporal_proportions_{prefix}.json"
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(path, "w") as f:
|
||||
json.dump(props, f, indent=2)
|
||||
print(f"Saved {prefix} temporal proportions")
|
||||
console.print(f"[green]Saved {prefix} temporal proportions[/green]")
|
||||
|
||||
save_proportions(sparse_annotations, "sparse", sparse_subtask_list, auto_sparse)
|
||||
save_proportions(sparse_annotations, "sparse", auto_sparse)
|
||||
if dense_mode and dense_annotations:
|
||||
save_proportions(dense_annotations, "dense", dense_subtask_list)
|
||||
save_proportions(dense_annotations, "dense")
|
||||
|
||||
print(f"\nComplete! {len(sparse_annotations)} sparse, {len(dense_annotations or {})} dense annotations")
|
||||
|
||||
# Visualize annotations after generation
|
||||
if args.num_visualizations > 0:
|
||||
print(f"\nGenerating {args.num_visualizations} visualizations...")
|
||||
visualize_type = "both" if dense_mode else "sparse"
|
||||
visualize_annotations(
|
||||
dataset=dataset,
|
||||
sparse_annotations=sparse_annotations,
|
||||
dense_annotations=dense_annotations,
|
||||
video_key=video_key,
|
||||
output_dir=Path(args.output_dir),
|
||||
num_episodes=args.num_visualizations,
|
||||
annotation_type=visualize_type,
|
||||
)
|
||||
console.print(
|
||||
f"\n[bold green]Complete! {len(sparse_annotations)} sparse, {len(dense_annotations or {})} dense annotations[/bold green]"
|
||||
)
|
||||
|
||||
if args.push_to_hub:
|
||||
try:
|
||||
dataset.push_to_hub(push_videos=True)
|
||||
print(f"Pushed to {args.output_repo_id or args.repo_id}")
|
||||
console.print(f"[green]Pushed to {args.output_repo_id or args.repo_id}[/green]")
|
||||
except Exception as e:
|
||||
print(f"Push failed: {e}")
|
||||
console.print(f"[red]Push failed: {e}[/red]")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
main()
|
||||
@@ -0,0 +1 @@
|
||||
srun --time 12:00:00 --qos=high --gres=gpu:1 --mem=24G --partition=hopper-prod --container-image /fsx/michel_aractingi/docker_images/huggingface+lerobot-gpu+dev.sqsh --container-mounts /fsx/jade_choghari
|
||||
@@ -0,0 +1,44 @@
|
||||
#!/bin/bash
|
||||
|
||||
# Quick test to verify the fix for task_indices length mismatch
|
||||
# This should now work correctly even with --num-samples < full dataset length
|
||||
|
||||
echo "Testing annotate_pgen.py with --num-samples=100 on full dataset..."
|
||||
|
||||
python examples/dataset/annotate_pgen.py \
|
||||
--data-dir /fsx/jade_choghari/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace \
|
||||
--model Qwen/Qwen3-VL-30B-A3B-Instruct \
|
||||
--num-samples 100 \
|
||||
--sample-interval 1.0 \
|
||||
--output-dir /fsx/jade_choghari/outputs/pgen_test_fixed
|
||||
|
||||
if [ $? -eq 0 ]; then
|
||||
echo "✓ SUCCESS: Script completed without errors!"
|
||||
echo ""
|
||||
echo "Verifying output..."
|
||||
|
||||
# Check that all frames have task_index_high_level
|
||||
python -c "
|
||||
from lerobot.datasets.lerobot_dataset import LeRobotDataset
|
||||
import numpy as np
|
||||
|
||||
ds = LeRobotDataset(repo_id='local_test', root='/fsx/jade_choghari/outputs/pgen_test_fixed')
|
||||
print(f'Dataset has {len(ds)} frames')
|
||||
print(f'Features: {list(ds.features.keys())}')
|
||||
|
||||
# Check that task_index_high_level exists
|
||||
assert 'task_index_high_level' in ds.features, 'task_index_high_level not in features!'
|
||||
|
||||
# Sample some frames
|
||||
for idx in [0, 50, 99, 100, 500, 1000, 11938]:
|
||||
if idx < len(ds):
|
||||
frame = ds[idx]
|
||||
task_idx = frame['task_index_high_level'].item()
|
||||
print(f'Frame {idx}: task_index_high_level = {task_idx}')
|
||||
|
||||
print('✓ All checks passed!')
|
||||
"
|
||||
else
|
||||
echo "✗ FAILED: Script exited with error code $?"
|
||||
fi
|
||||
|
||||
@@ -21,7 +21,7 @@ from lerobot.robots.lekiwi.config_lekiwi import LeKiwiClientConfig
|
||||
from lerobot.robots.lekiwi.lekiwi_client import LeKiwiClient
|
||||
from lerobot.scripts.lerobot_record import record_loop
|
||||
from lerobot.teleoperators.keyboard import KeyboardTeleop, KeyboardTeleopConfig
|
||||
from lerobot.teleoperators.so_leader import SO100Leader, SO100LeaderConfig
|
||||
from lerobot.teleoperators.so100_leader import SO100Leader, SO100LeaderConfig
|
||||
from lerobot.utils.constants import ACTION, OBS_STR
|
||||
from lerobot.utils.control_utils import init_keyboard_listener
|
||||
from lerobot.utils.utils import log_say
|
||||
|
||||
@@ -18,7 +18,7 @@ import time
|
||||
|
||||
from lerobot.robots.lekiwi import LeKiwiClient, LeKiwiClientConfig
|
||||
from lerobot.teleoperators.keyboard.teleop_keyboard import KeyboardTeleop, KeyboardTeleopConfig
|
||||
from lerobot.teleoperators.so_leader import SO100Leader, SO100LeaderConfig
|
||||
from lerobot.teleoperators.so100_leader import SO100Leader, SO100LeaderConfig
|
||||
from lerobot.utils.robot_utils import precise_sleep
|
||||
from lerobot.utils.visualization_utils import init_rerun, log_rerun_data
|
||||
|
||||
|
||||
@@ -34,11 +34,12 @@ from lerobot.processor.converters import (
|
||||
transition_to_observation,
|
||||
transition_to_robot_action,
|
||||
)
|
||||
from lerobot.robots.so_follower import SO100Follower, SO100FollowerConfig
|
||||
from lerobot.robots.so_follower.robot_kinematic_processor import (
|
||||
from lerobot.robots.so100_follower.config_so100_follower import SO100FollowerConfig
|
||||
from lerobot.robots.so100_follower.robot_kinematic_processor import (
|
||||
ForwardKinematicsJointsToEE,
|
||||
InverseKinematicsEEToJoints,
|
||||
)
|
||||
from lerobot.robots.so100_follower.so100_follower import SO100Follower
|
||||
from lerobot.scripts.lerobot_record import record_loop
|
||||
from lerobot.utils.control_utils import init_keyboard_listener
|
||||
from lerobot.utils.utils import log_say
|
||||
|
||||
@@ -26,14 +26,15 @@ from lerobot.processor.converters import (
|
||||
transition_to_observation,
|
||||
transition_to_robot_action,
|
||||
)
|
||||
from lerobot.robots.so_follower import SO100Follower, SO100FollowerConfig
|
||||
from lerobot.robots.so_follower.robot_kinematic_processor import (
|
||||
from lerobot.robots.so100_follower.config_so100_follower import SO100FollowerConfig
|
||||
from lerobot.robots.so100_follower.robot_kinematic_processor import (
|
||||
EEBoundsAndSafety,
|
||||
EEReferenceAndDelta,
|
||||
ForwardKinematicsJointsToEE,
|
||||
GripperVelocityToJoint,
|
||||
InverseKinematicsEEToJoints,
|
||||
)
|
||||
from lerobot.robots.so100_follower.so100_follower import SO100Follower
|
||||
from lerobot.scripts.lerobot_record import record_loop
|
||||
from lerobot.teleoperators.phone.config_phone import PhoneConfig, PhoneOS
|
||||
from lerobot.teleoperators.phone.phone_processor import MapPhoneActionToRobotAction
|
||||
|
||||
@@ -23,10 +23,11 @@ from lerobot.processor.converters import (
|
||||
robot_action_observation_to_transition,
|
||||
transition_to_robot_action,
|
||||
)
|
||||
from lerobot.robots.so_follower import SO100Follower, SO100FollowerConfig
|
||||
from lerobot.robots.so_follower.robot_kinematic_processor import (
|
||||
from lerobot.robots.so100_follower.config_so100_follower import SO100FollowerConfig
|
||||
from lerobot.robots.so100_follower.robot_kinematic_processor import (
|
||||
InverseKinematicsEEToJoints,
|
||||
)
|
||||
from lerobot.robots.so100_follower.so100_follower import SO100Follower
|
||||
from lerobot.utils.constants import ACTION
|
||||
from lerobot.utils.robot_utils import precise_sleep
|
||||
from lerobot.utils.utils import log_say
|
||||
@@ -95,7 +96,7 @@ def main():
|
||||
# Send action to robot
|
||||
_ = robot.send_action(joint_action)
|
||||
|
||||
precise_sleep(max(1.0 / dataset.fps - (time.perf_counter() - t0), 0.0))
|
||||
precise_sleep(1.0 / dataset.fps - (time.perf_counter() - t0))
|
||||
|
||||
# Clean up
|
||||
robot.disconnect()
|
||||
|
||||
@@ -21,13 +21,14 @@ from lerobot.processor.converters import (
|
||||
robot_action_observation_to_transition,
|
||||
transition_to_robot_action,
|
||||
)
|
||||
from lerobot.robots.so_follower import SO100Follower, SO100FollowerConfig
|
||||
from lerobot.robots.so_follower.robot_kinematic_processor import (
|
||||
from lerobot.robots.so100_follower.config_so100_follower import SO100FollowerConfig
|
||||
from lerobot.robots.so100_follower.robot_kinematic_processor import (
|
||||
EEBoundsAndSafety,
|
||||
EEReferenceAndDelta,
|
||||
GripperVelocityToJoint,
|
||||
InverseKinematicsEEToJoints,
|
||||
)
|
||||
from lerobot.robots.so100_follower.so100_follower import SO100Follower
|
||||
from lerobot.teleoperators.phone.config_phone import PhoneConfig, PhoneOS
|
||||
from lerobot.teleoperators.phone.phone_processor import MapPhoneActionToRobotAction
|
||||
from lerobot.teleoperators.phone.teleop_phone import Phone
|
||||
|
||||
@@ -94,9 +94,9 @@ from lerobot.rl.process import ProcessSignalHandler
|
||||
from lerobot.robots import ( # noqa: F401
|
||||
Robot,
|
||||
RobotConfig,
|
||||
bi_so_follower,
|
||||
koch_follower,
|
||||
so_follower,
|
||||
so100_follower,
|
||||
so101_follower,
|
||||
)
|
||||
from lerobot.robots.utils import make_robot_from_config
|
||||
from lerobot.utils.constants import OBS_IMAGES
|
||||
@@ -455,18 +455,7 @@ def demo_cli(cfg: RTCDemoConfig):
|
||||
if cfg.policy.type == "pi05" or cfg.policy.type == "pi0":
|
||||
config.compile_model = cfg.use_torch_compile
|
||||
|
||||
if config.use_peft:
|
||||
from peft import PeftConfig, PeftModel
|
||||
|
||||
peft_pretrained_path = cfg.policy.pretrained_path
|
||||
peft_config = PeftConfig.from_pretrained(peft_pretrained_path)
|
||||
|
||||
policy = policy_class.from_pretrained(
|
||||
pretrained_name_or_path=peft_config.base_model_name_or_path, config=config
|
||||
)
|
||||
policy = PeftModel.from_pretrained(policy, peft_pretrained_path, config=peft_config)
|
||||
else:
|
||||
policy = policy_class.from_pretrained(cfg.policy.pretrained_path, config=config)
|
||||
policy = policy_class.from_pretrained(cfg.policy.pretrained_path, config=config)
|
||||
|
||||
# Turn on RTC
|
||||
policy.config.rtc_config = cfg.rtc
|
||||
|
||||
@@ -34,11 +34,12 @@ from lerobot.processor.converters import (
|
||||
transition_to_observation,
|
||||
transition_to_robot_action,
|
||||
)
|
||||
from lerobot.robots.so_follower import SO100Follower, SO100FollowerConfig
|
||||
from lerobot.robots.so_follower.robot_kinematic_processor import (
|
||||
from lerobot.robots.so100_follower.config_so100_follower import SO100FollowerConfig
|
||||
from lerobot.robots.so100_follower.robot_kinematic_processor import (
|
||||
ForwardKinematicsJointsToEE,
|
||||
InverseKinematicsEEToJoints,
|
||||
)
|
||||
from lerobot.robots.so100_follower.so100_follower import SO100Follower
|
||||
from lerobot.scripts.lerobot_record import record_loop
|
||||
from lerobot.utils.control_utils import init_keyboard_listener
|
||||
from lerobot.utils.utils import log_say
|
||||
|
||||
@@ -27,14 +27,16 @@ from lerobot.processor.converters import (
|
||||
transition_to_observation,
|
||||
transition_to_robot_action,
|
||||
)
|
||||
from lerobot.robots.so_follower import SO100Follower, SO100FollowerConfig
|
||||
from lerobot.robots.so_follower.robot_kinematic_processor import (
|
||||
from lerobot.robots.so100_follower.config_so100_follower import SO100FollowerConfig
|
||||
from lerobot.robots.so100_follower.robot_kinematic_processor import (
|
||||
EEBoundsAndSafety,
|
||||
ForwardKinematicsJointsToEE,
|
||||
InverseKinematicsEEToJoints,
|
||||
)
|
||||
from lerobot.robots.so100_follower.so100_follower import SO100Follower
|
||||
from lerobot.scripts.lerobot_record import record_loop
|
||||
from lerobot.teleoperators.so_leader import SO100Leader, SO100LeaderConfig
|
||||
from lerobot.teleoperators.so100_leader.config_so100_leader import SO100LeaderConfig
|
||||
from lerobot.teleoperators.so100_leader.so100_leader import SO100Leader
|
||||
from lerobot.utils.control_utils import init_keyboard_listener
|
||||
from lerobot.utils.utils import log_say
|
||||
from lerobot.utils.visualization_utils import init_rerun
|
||||
|
||||
@@ -24,10 +24,11 @@ from lerobot.processor.converters import (
|
||||
robot_action_observation_to_transition,
|
||||
transition_to_robot_action,
|
||||
)
|
||||
from lerobot.robots.so_follower import SO100Follower, SO100FollowerConfig
|
||||
from lerobot.robots.so_follower.robot_kinematic_processor import (
|
||||
from lerobot.robots.so100_follower.config_so100_follower import SO100FollowerConfig
|
||||
from lerobot.robots.so100_follower.robot_kinematic_processor import (
|
||||
InverseKinematicsEEToJoints,
|
||||
)
|
||||
from lerobot.robots.so100_follower.so100_follower import SO100Follower
|
||||
from lerobot.utils.constants import ACTION
|
||||
from lerobot.utils.robot_utils import precise_sleep
|
||||
from lerobot.utils.utils import log_say
|
||||
@@ -96,7 +97,7 @@ def main():
|
||||
# Send action to robot
|
||||
_ = robot.send_action(joint_action)
|
||||
|
||||
precise_sleep(max(1.0 / dataset.fps - (time.perf_counter() - t0), 0.0))
|
||||
precise_sleep(1.0 / dataset.fps - (time.perf_counter() - t0))
|
||||
|
||||
# Clean up
|
||||
robot.disconnect()
|
||||
|
||||
@@ -23,13 +23,15 @@ from lerobot.processor.converters import (
|
||||
robot_action_to_transition,
|
||||
transition_to_robot_action,
|
||||
)
|
||||
from lerobot.robots.so_follower import SO100Follower, SO100FollowerConfig
|
||||
from lerobot.robots.so_follower.robot_kinematic_processor import (
|
||||
from lerobot.robots.so100_follower.config_so100_follower import SO100FollowerConfig
|
||||
from lerobot.robots.so100_follower.robot_kinematic_processor import (
|
||||
EEBoundsAndSafety,
|
||||
ForwardKinematicsJointsToEE,
|
||||
InverseKinematicsEEToJoints,
|
||||
)
|
||||
from lerobot.teleoperators.so_leader import SO100Leader, SO100LeaderConfig
|
||||
from lerobot.robots.so100_follower.so100_follower import SO100Follower
|
||||
from lerobot.teleoperators.so100_leader.config_so100_leader import SO100LeaderConfig
|
||||
from lerobot.teleoperators.so100_leader.so100_leader import SO100Leader
|
||||
from lerobot.utils.robot_utils import precise_sleep
|
||||
from lerobot.utils.visualization_utils import init_rerun, log_rerun_data
|
||||
|
||||
|
||||
@@ -5,7 +5,8 @@ from lerobot.datasets.lerobot_dataset import LeRobotDatasetMetadata
|
||||
from lerobot.policies.act.modeling_act import ACTPolicy
|
||||
from lerobot.policies.factory import make_pre_post_processors
|
||||
from lerobot.policies.utils import build_inference_frame, make_robot_action
|
||||
from lerobot.robots.so_follower import SO100Follower, SO100FollowerConfig
|
||||
from lerobot.robots.so100_follower.config_so100_follower import SO100FollowerConfig
|
||||
from lerobot.robots.so100_follower.so100_follower import SO100Follower
|
||||
|
||||
MAX_EPISODES = 5
|
||||
MAX_STEPS_PER_EPISODE = 20
|
||||
|
||||
@@ -4,7 +4,7 @@ from lerobot.async_inference.configs import RobotClientConfig
|
||||
from lerobot.async_inference.helpers import visualize_action_queue_size
|
||||
from lerobot.async_inference.robot_client import RobotClient
|
||||
from lerobot.cameras.opencv.configuration_opencv import OpenCVCameraConfig
|
||||
from lerobot.robots.so_follower import SO100FollowerConfig
|
||||
from lerobot.robots.so100_follower import SO100FollowerConfig
|
||||
|
||||
|
||||
def main():
|
||||
|
||||
@@ -5,7 +5,8 @@ from lerobot.datasets.lerobot_dataset import LeRobotDatasetMetadata
|
||||
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
|
||||
from lerobot.policies.factory import make_pre_post_processors
|
||||
from lerobot.policies.utils import build_inference_frame, make_robot_action
|
||||
from lerobot.robots.so_follower import SO100Follower, SO100FollowerConfig
|
||||
from lerobot.robots.so100_follower.config_so100_follower import SO100FollowerConfig
|
||||
from lerobot.robots.so100_follower.so100_follower import SO100Follower
|
||||
|
||||
MAX_EPISODES = 5
|
||||
MAX_STEPS_PER_EPISODE = 20
|
||||
|
||||
@@ -5,7 +5,8 @@ from lerobot.datasets.utils import hw_to_dataset_features
|
||||
from lerobot.policies.factory import make_pre_post_processors
|
||||
from lerobot.policies.pi0.modeling_pi0 import PI0Policy
|
||||
from lerobot.policies.utils import build_inference_frame, make_robot_action
|
||||
from lerobot.robots.so_follower import SO100Follower, SO100FollowerConfig
|
||||
from lerobot.robots.so100_follower.config_so100_follower import SO100FollowerConfig
|
||||
from lerobot.robots.so100_follower.so100_follower import SO100Follower
|
||||
|
||||
MAX_EPISODES = 5
|
||||
MAX_STEPS_PER_EPISODE = 20
|
||||
|
||||
@@ -14,8 +14,8 @@ from lerobot.policies.sac.modeling_sac import SACPolicy
|
||||
from lerobot.policies.sac.reward_model.modeling_classifier import Classifier
|
||||
from lerobot.rl.buffer import ReplayBuffer
|
||||
from lerobot.rl.gym_manipulator import make_robot_env
|
||||
from lerobot.robots.so_follower import SO100FollowerConfig
|
||||
from lerobot.teleoperators.so_leader import SO100LeaderConfig
|
||||
from lerobot.robots.so100_follower import SO100FollowerConfig
|
||||
from lerobot.teleoperators.so100_leader import SO100LeaderConfig
|
||||
from lerobot.teleoperators.utils import TeleopEvents
|
||||
|
||||
LOG_EVERY = 10
|
||||
|
||||
@@ -5,7 +5,8 @@ from lerobot.datasets.utils import hw_to_dataset_features
|
||||
from lerobot.policies.factory import make_pre_post_processors
|
||||
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
|
||||
from lerobot.policies.utils import build_inference_frame, make_robot_action
|
||||
from lerobot.robots.so_follower import SO100Follower, SO100FollowerConfig
|
||||
from lerobot.robots.so100_follower.config_so100_follower import SO100FollowerConfig
|
||||
from lerobot.robots.so100_follower.so100_follower import SO100Follower
|
||||
|
||||
MAX_EPISODES = 5
|
||||
MAX_STEPS_PER_EPISODE = 20
|
||||
|
||||
@@ -13,9 +13,16 @@
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
Example: GR00T Locomotion with Pre-loaded Policies
|
||||
|
||||
This example demonstrates the NEW pattern for loading GR00T policies externally
|
||||
and passing them to the robot class.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import threading
|
||||
import time
|
||||
from collections import deque
|
||||
|
||||
@@ -24,26 +31,24 @@ import onnxruntime as ort
|
||||
from huggingface_hub import hf_hub_download
|
||||
|
||||
from lerobot.robots.unitree_g1.config_unitree_g1 import UnitreeG1Config
|
||||
from lerobot.robots.unitree_g1.g1_utils import G1_29_JointIndex
|
||||
from lerobot.robots.unitree_g1.unitree_g1 import UnitreeG1
|
||||
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
GROOT_DEFAULT_ANGLES = np.zeros(29, dtype=np.float32)
|
||||
GROOT_DEFAULT_ANGLES[[0, 6]] = -0.1 # Hip pitch
|
||||
GROOT_DEFAULT_ANGLES[[3, 9]] = 0.3 # Knee
|
||||
GROOT_DEFAULT_ANGLES[[4, 10]] = -0.2 # Ankle pitch
|
||||
GROOT_DEFAULT_ANGLES[[0, 6]] = -0.1 # hip pitch
|
||||
GROOT_DEFAULT_ANGLES[[3, 9]] = 0.3 # knee
|
||||
GROOT_DEFAULT_ANGLES[[4, 10]] = -0.2 # ankle pitch
|
||||
|
||||
MISSING_JOINTS = []
|
||||
G1_MODEL = "g1_23" # Or "g1_29"
|
||||
G1_MODEL = "g1_23" # or "g1_29"
|
||||
if G1_MODEL == "g1_23":
|
||||
MISSING_JOINTS = [12, 14, 20, 21, 27, 28] # Waist yaw/pitch, wrist pitch/yaw
|
||||
MISSING_JOINTS = [12, 14, 20, 21, 27, 28] # waist yaw/pitch, wrist pitch/yaw
|
||||
|
||||
LOCOMOTION_ACTION_SCALE = 0.25
|
||||
|
||||
LOCOMOTION_CONTROL_DT = 0.02
|
||||
|
||||
# Control parameters
|
||||
ACTION_SCALE = 0.25
|
||||
CONTROL_DT = 0.02 # 50Hz
|
||||
ANG_VEL_SCALE: float = 0.25
|
||||
DOF_POS_SCALE: float = 1.0
|
||||
DOF_VEL_SCALE: float = 0.05
|
||||
@@ -56,12 +61,12 @@ DEFAULT_GROOT_REPO_ID = "nepyope/GR00T-WholeBodyControl_g1"
|
||||
def load_groot_policies(
|
||||
repo_id: str = DEFAULT_GROOT_REPO_ID,
|
||||
) -> tuple[ort.InferenceSession, ort.InferenceSession]:
|
||||
"""Load GR00T dual-policy system (Balance + Walk) from the hub.
|
||||
"""Load GR00T dual-policy system (Balance + Walk) from Hugging Face Hub.
|
||||
|
||||
Args:
|
||||
repo_id: Hugging Face Hub repository ID containing the ONNX policies.
|
||||
"""
|
||||
logger.info(f"Loading GR00T dual-policy system from the hub ({repo_id})...")
|
||||
logger.info(f"Loading GR00T dual-policy system from Hugging Face Hub ({repo_id})...")
|
||||
|
||||
# Download ONNX policies from Hugging Face Hub
|
||||
balance_path = hf_hub_download(
|
||||
@@ -83,7 +88,15 @@ def load_groot_policies(
|
||||
|
||||
|
||||
class GrootLocomotionController:
|
||||
"""GR00T lower-body locomotion controller for the Unitree G1."""
|
||||
"""
|
||||
Handles GR00T-style locomotion control for the Unitree G1 robot.
|
||||
|
||||
This controller manages:
|
||||
- Dual-policy system (Balance + Walk)
|
||||
- 29-joint observation processing
|
||||
- 15D action output (legs + waist)
|
||||
- Policy inference and motor command generation
|
||||
"""
|
||||
|
||||
def __init__(self, policy_balance, policy_walk, robot, config):
|
||||
self.policy_balance = policy_balance
|
||||
@@ -91,9 +104,9 @@ class GrootLocomotionController:
|
||||
self.robot = robot
|
||||
self.config = config
|
||||
|
||||
self.cmd = np.array([0.0, 0.0, 0.0], dtype=np.float32) # vx, vy, theta_dot
|
||||
self.locomotion_cmd = np.array([0.0, 0.0, 0.0], dtype=np.float32) # vx, vy, theta_dot
|
||||
|
||||
# Robot state
|
||||
# GR00T-specific state
|
||||
self.groot_qj_all = np.zeros(29, dtype=np.float32)
|
||||
self.groot_dqj_all = np.zeros(29, dtype=np.float32)
|
||||
self.groot_action = np.zeros(15, dtype=np.float32)
|
||||
@@ -103,39 +116,47 @@ class GrootLocomotionController:
|
||||
self.groot_height_cmd = 0.74 # Default base height
|
||||
self.groot_orientation_cmd = np.array([0.0, 0.0, 0.0], dtype=np.float32)
|
||||
|
||||
# Input to GR00T is 6 frames (6*86D=516)
|
||||
# input to gr00t is 6 frames (6*86D=516)
|
||||
for _ in range(6):
|
||||
self.groot_obs_history.append(np.zeros(86, dtype=np.float32))
|
||||
|
||||
# Thread management
|
||||
self.locomotion_running = False
|
||||
self.locomotion_thread = None
|
||||
|
||||
logger.info("GrootLocomotionController initialized")
|
||||
|
||||
def run_step(self):
|
||||
# Get current observation
|
||||
obs = self.robot.get_observation()
|
||||
def groot_locomotion_run(self):
|
||||
# get current observation
|
||||
robot_state = self.robot.get_observation()
|
||||
|
||||
if not obs:
|
||||
if robot_state is None:
|
||||
return
|
||||
|
||||
# Get command from remote controller
|
||||
if obs["remote.buttons"][0]: # R1 - raise waist
|
||||
self.groot_height_cmd += 0.001
|
||||
self.groot_height_cmd = np.clip(self.groot_height_cmd, 0.50, 1.00)
|
||||
if obs["remote.buttons"][4]: # R2 - lower waist
|
||||
self.groot_height_cmd -= 0.001
|
||||
self.groot_height_cmd = np.clip(self.groot_height_cmd, 0.50, 1.00)
|
||||
# get command from remote controller
|
||||
if robot_state.wireless_remote is not None:
|
||||
self.robot.remote_controller.set(robot_state.wireless_remote)
|
||||
if self.robot.remote_controller.button[0]: # R1 - raise waist
|
||||
self.groot_height_cmd += 0.001
|
||||
self.groot_height_cmd = np.clip(self.groot_height_cmd, 0.50, 1.00)
|
||||
if self.robot.remote_controller.button[4]: # R2 - lower waist
|
||||
self.groot_height_cmd -= 0.001
|
||||
self.groot_height_cmd = np.clip(self.groot_height_cmd, 0.50, 1.00)
|
||||
else:
|
||||
self.robot.remote_controller.lx = 0.0
|
||||
self.robot.remote_controller.ly = 0.0
|
||||
self.robot.remote_controller.rx = 0.0
|
||||
self.robot.remote_controller.ry = 0.0
|
||||
|
||||
self.cmd[0] = obs["remote.ly"] # Forward/backward
|
||||
self.cmd[1] = obs["remote.lx"] * -1 # Left/right
|
||||
self.cmd[2] = obs["remote.rx"] * -1 # Rotation rate
|
||||
self.locomotion_cmd[0] = self.robot.remote_controller.ly # forward/backward
|
||||
self.locomotion_cmd[1] = self.robot.remote_controller.lx * -1 # left/right
|
||||
self.locomotion_cmd[2] = self.robot.remote_controller.rx * -1 # rotation rate
|
||||
|
||||
# Get joint positions and velocities from flat dict
|
||||
for motor in G1_29_JointIndex:
|
||||
name = motor.name
|
||||
idx = motor.value
|
||||
self.groot_qj_all[idx] = obs[f"{name}.q"]
|
||||
self.groot_dqj_all[idx] = obs[f"{name}.dq"]
|
||||
for i in range(29):
|
||||
self.groot_qj_all[i] = robot_state.motor_state[i].q
|
||||
self.groot_dqj_all[i] = robot_state.motor_state[i].dq
|
||||
|
||||
# Adapt observation for g1_23dof
|
||||
# adapt observation for g1_23dof
|
||||
for idx in MISSING_JOINTS:
|
||||
self.groot_qj_all[idx] = 0.0
|
||||
self.groot_dqj_all[idx] = 0.0
|
||||
@@ -144,18 +165,18 @@ class GrootLocomotionController:
|
||||
qj_obs = self.groot_qj_all.copy()
|
||||
dqj_obs = self.groot_dqj_all.copy()
|
||||
|
||||
# Express IMU data in gravity frame of reference
|
||||
quat = [obs["imu.quat.w"], obs["imu.quat.x"], obs["imu.quat.y"], obs["imu.quat.z"]]
|
||||
ang_vel = np.array([obs["imu.gyro.x"], obs["imu.gyro.y"], obs["imu.gyro.z"]], dtype=np.float32)
|
||||
# express imu data in gravity frame of reference
|
||||
quat = robot_state.imu_state.quaternion
|
||||
ang_vel = np.array(robot_state.imu_state.gyroscope, dtype=np.float32)
|
||||
gravity_orientation = self.robot.get_gravity_orientation(quat)
|
||||
|
||||
# Scale joint positions and velocities before policy inference
|
||||
# scale joint positions and velocities before policy inference
|
||||
qj_obs = (qj_obs - GROOT_DEFAULT_ANGLES) * DOF_POS_SCALE
|
||||
dqj_obs = dqj_obs * DOF_VEL_SCALE
|
||||
ang_vel_scaled = ang_vel * ANG_VEL_SCALE
|
||||
|
||||
# Build single frame observation
|
||||
self.groot_obs_single[:3] = self.cmd * np.array(CMD_SCALE)
|
||||
# build single frame observation
|
||||
self.groot_obs_single[:3] = self.locomotion_cmd * np.array(CMD_SCALE)
|
||||
self.groot_obs_single[3] = self.groot_height_cmd
|
||||
self.groot_obs_single[4:7] = self.groot_orientation_cmd
|
||||
self.groot_obs_single[7:10] = ang_vel_scaled
|
||||
@@ -173,76 +194,113 @@ class GrootLocomotionController:
|
||||
end_idx = start_idx + 86
|
||||
self.groot_obs_stacked[start_idx:end_idx] = obs_frame
|
||||
|
||||
cmd_magnitude = np.linalg.norm(self.cmd)
|
||||
# Run policy inference (ONNX) with 516D stacked observation
|
||||
|
||||
cmd_magnitude = np.linalg.norm(self.locomotion_cmd)
|
||||
|
||||
selected_policy = (
|
||||
self.policy_balance if cmd_magnitude < 0.05 else self.policy_walk
|
||||
) # Balance/standing policy for small commands, walking policy for movement commands
|
||||
) # balance/standing policy for small commands, walking policy for movement commands
|
||||
|
||||
# Run policy inference
|
||||
# run policy inference
|
||||
ort_inputs = {selected_policy.get_inputs()[0].name: np.expand_dims(self.groot_obs_stacked, axis=0)}
|
||||
ort_outs = selected_policy.run(None, ort_inputs)
|
||||
self.groot_action = ort_outs[0].squeeze()
|
||||
|
||||
# Transform action back to target joint positions
|
||||
target_dof_pos_15 = GROOT_DEFAULT_ANGLES[:15] + self.groot_action * ACTION_SCALE
|
||||
# transform action back to target joint positions
|
||||
target_dof_pos_15 = GROOT_DEFAULT_ANGLES[:15] + self.groot_action * LOCOMOTION_ACTION_SCALE
|
||||
|
||||
# Build action dict (only first 15 joints for GR00T)
|
||||
action_dict = {}
|
||||
# command motors
|
||||
for i in range(15):
|
||||
motor_name = G1_29_JointIndex(i).name
|
||||
action_dict[f"{motor_name}.q"] = float(target_dof_pos_15[i])
|
||||
motor_idx = i
|
||||
self.robot.msg.motor_cmd[motor_idx].q = target_dof_pos_15[i]
|
||||
self.robot.msg.motor_cmd[motor_idx].qd = 0
|
||||
self.robot.msg.motor_cmd[motor_idx].kp = self.robot.kp[motor_idx]
|
||||
self.robot.msg.motor_cmd[motor_idx].kd = self.robot.kd[motor_idx]
|
||||
self.robot.msg.motor_cmd[motor_idx].tau = 0
|
||||
|
||||
# Zero out missing joints for g1_23dof
|
||||
# adapt action for g1_23dof
|
||||
for joint_idx in MISSING_JOINTS:
|
||||
motor_name = G1_29_JointIndex(joint_idx).name
|
||||
action_dict[f"{motor_name}.q"] = 0.0
|
||||
self.robot.msg.motor_cmd[joint_idx].q = 0.0
|
||||
self.robot.msg.motor_cmd[joint_idx].qd = 0
|
||||
self.robot.msg.motor_cmd[joint_idx].kp = self.robot.kp[joint_idx]
|
||||
self.robot.msg.motor_cmd[joint_idx].kd = self.robot.kd[joint_idx]
|
||||
self.robot.msg.motor_cmd[joint_idx].tau = 0
|
||||
|
||||
# Send action to robot
|
||||
self.robot.send_action(action_dict)
|
||||
# send action to robot
|
||||
self.robot.send_action(self.robot.msg)
|
||||
|
||||
|
||||
def run(repo_id: str = DEFAULT_GROOT_REPO_ID) -> None:
|
||||
"""Main function to run the GR00T locomotion controller.
|
||||
|
||||
Args:
|
||||
repo_id: Hugging Face Hub repository ID for GR00T policies.
|
||||
"""
|
||||
# Load policies
|
||||
policy_balance, policy_walk = load_groot_policies(repo_id=repo_id)
|
||||
|
||||
# Initialize robot
|
||||
config = UnitreeG1Config()
|
||||
robot = UnitreeG1(config)
|
||||
|
||||
robot.connect()
|
||||
|
||||
# Initialize gr00T locomotion controller
|
||||
groot_controller = GrootLocomotionController(
|
||||
policy_balance=policy_balance,
|
||||
policy_walk=policy_walk,
|
||||
robot=robot,
|
||||
config=config,
|
||||
)
|
||||
|
||||
try:
|
||||
robot.reset(CONTROL_DT, GROOT_DEFAULT_ANGLES)
|
||||
|
||||
logger.info("Use joystick: LY=fwd/back, LX=left/right, RX=rotate, R1=raise waist, R2=lower waist")
|
||||
logger.info("Press Ctrl+C to stop")
|
||||
|
||||
# Run step
|
||||
while not robot._shutdown_event.is_set():
|
||||
def _locomotion_thread_loop(self):
|
||||
"""Background thread that runs the locomotion policy at specified rate."""
|
||||
logger.info("Locomotion thread started")
|
||||
while self.locomotion_running:
|
||||
start_time = time.time()
|
||||
groot_controller.run_step()
|
||||
try:
|
||||
self.groot_locomotion_run()
|
||||
except Exception as e:
|
||||
logger.error(f"Error in locomotion loop: {e}")
|
||||
|
||||
# Sleep to maintain control rate
|
||||
elapsed = time.time() - start_time
|
||||
sleep_time = max(0, CONTROL_DT - elapsed)
|
||||
sleep_time = max(0, LOCOMOTION_CONTROL_DT - elapsed)
|
||||
time.sleep(sleep_time)
|
||||
except KeyboardInterrupt:
|
||||
logger.info("Stopping locomotion...")
|
||||
finally:
|
||||
if robot.is_connected:
|
||||
robot.disconnect()
|
||||
logger.info("Done!")
|
||||
logger.info("Locomotion thread stopped")
|
||||
|
||||
def start_locomotion_thread(self):
|
||||
if self.locomotion_running:
|
||||
logger.warning("Locomotion thread already running")
|
||||
return
|
||||
|
||||
logger.info("Starting locomotion control thread...")
|
||||
self.locomotion_running = True
|
||||
self.locomotion_thread = threading.Thread(target=self._locomotion_thread_loop, daemon=True)
|
||||
self.locomotion_thread.start()
|
||||
|
||||
logger.info("Locomotion control thread started!")
|
||||
|
||||
def stop_locomotion_thread(self):
|
||||
if not self.locomotion_running:
|
||||
return
|
||||
|
||||
logger.info("Stopping locomotion control thread...")
|
||||
self.locomotion_running = False
|
||||
if self.locomotion_thread:
|
||||
self.locomotion_thread.join(timeout=2.0)
|
||||
logger.info("Locomotion control thread stopped")
|
||||
|
||||
def reset_robot(self):
|
||||
"""Move robot legs to default standing position over 2 seconds (arms are not moved)."""
|
||||
total_time = 3.0
|
||||
num_step = int(total_time / self.robot.control_dt)
|
||||
|
||||
# Only control legs, not arms (first 12 joints)
|
||||
default_pos = GROOT_DEFAULT_ANGLES # First 12 values are leg angles
|
||||
dof_size = len(default_pos)
|
||||
|
||||
# Get current lowstate
|
||||
robot_state = self.robot.get_observation()
|
||||
|
||||
# Record the current leg positions
|
||||
init_dof_pos = np.zeros(dof_size, dtype=np.float32)
|
||||
for i in range(dof_size):
|
||||
init_dof_pos[i] = robot_state.motor_state[i].q
|
||||
|
||||
# Move legs to default pos
|
||||
for i in range(num_step):
|
||||
alpha = i / num_step
|
||||
for motor_idx in range(dof_size):
|
||||
target_pos = default_pos[motor_idx]
|
||||
self.robot.msg.motor_cmd[motor_idx].q = (
|
||||
init_dof_pos[motor_idx] * (1 - alpha) + target_pos * alpha
|
||||
)
|
||||
self.robot.msg.motor_cmd[motor_idx].qd = 0
|
||||
self.robot.msg.motor_cmd[motor_idx].kp = self.robot.kp[motor_idx]
|
||||
self.robot.msg.motor_cmd[motor_idx].kd = self.robot.kd[motor_idx]
|
||||
self.robot.msg.motor_cmd[motor_idx].tau = 0
|
||||
self.robot.msg.crc = self.robot.crc.Crc(self.robot.msg)
|
||||
self.robot.lowcmd_publisher.Write(self.robot.msg)
|
||||
time.sleep(self.robot.control_dt)
|
||||
logger.info("Reached default position (legs only)")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
@@ -255,4 +313,35 @@ if __name__ == "__main__":
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
run(repo_id=args.repo_id)
|
||||
# load policies
|
||||
policy_balance, policy_walk = load_groot_policies(repo_id=args.repo_id)
|
||||
|
||||
# initialize robot
|
||||
config = UnitreeG1Config()
|
||||
robot = UnitreeG1(config)
|
||||
|
||||
# initialize gr00t locomotion controller
|
||||
groot_controller = GrootLocomotionController(
|
||||
policy_balance=policy_balance,
|
||||
policy_walk=policy_walk,
|
||||
robot=robot,
|
||||
config=config,
|
||||
)
|
||||
|
||||
# reset legs and start locomotion thread
|
||||
try:
|
||||
groot_controller.reset_robot()
|
||||
groot_controller.start_locomotion_thread()
|
||||
|
||||
# log status
|
||||
logger.info("Robot initialized with GR00T locomotion policies")
|
||||
logger.info("Locomotion controller running in background thread")
|
||||
logger.info("Press Ctrl+C to stop")
|
||||
|
||||
# keep robot alive
|
||||
while True:
|
||||
time.sleep(1.0)
|
||||
except KeyboardInterrupt:
|
||||
print("\nStopping locomotion...")
|
||||
groot_controller.stop_locomotion_thread()
|
||||
print("Done!")
|
||||
|
||||
@@ -1,264 +0,0 @@
|
||||
#!/usr/bin/env python
|
||||
|
||||
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import time
|
||||
|
||||
import numpy as np
|
||||
import onnx
|
||||
import onnxruntime as ort
|
||||
from huggingface_hub import hf_hub_download
|
||||
|
||||
from lerobot.robots.unitree_g1.config_unitree_g1 import UnitreeG1Config
|
||||
from lerobot.robots.unitree_g1.g1_utils import G1_29_JointIndex
|
||||
from lerobot.robots.unitree_g1.unitree_g1 import UnitreeG1
|
||||
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
DEFAULT_ANGLES = np.zeros(29, dtype=np.float32)
|
||||
DEFAULT_ANGLES[[0, 6]] = -0.312 # Hip pitch
|
||||
DEFAULT_ANGLES[[3, 9]] = 0.669 # Knee
|
||||
DEFAULT_ANGLES[[4, 10]] = -0.363 # Ankle pitch
|
||||
DEFAULT_ANGLES[[15, 22]] = 0.2 # Shoulder pitch
|
||||
DEFAULT_ANGLES[16] = 0.2 # Left shoulder roll
|
||||
DEFAULT_ANGLES[23] = -0.2 # Right shoulder roll
|
||||
DEFAULT_ANGLES[[18, 25]] = 0.6 # Elbow
|
||||
|
||||
MISSING_JOINTS = []
|
||||
G1_MODEL = "g1_23" # Or "g1_29"
|
||||
if G1_MODEL == "g1_23":
|
||||
MISSING_JOINTS = [12, 14, 20, 21, 27, 28] # Waist yaw/pitch, wrist pitch/yaw
|
||||
|
||||
# Control parameters
|
||||
ACTION_SCALE = 0.25
|
||||
CONTROL_DT = 0.02 # 50Hz
|
||||
ANG_VEL_SCALE = 0.25
|
||||
DOF_POS_SCALE = 1.0
|
||||
DOF_VEL_SCALE = 0.05
|
||||
GAIT_PERIOD = 1.0
|
||||
|
||||
|
||||
DEFAULT_HOLOSOMA_REPO_ID = "nepyope/holosoma_locomotion"
|
||||
|
||||
# Policy filename mapping
|
||||
POLICY_FILES = {
|
||||
"fastsac": "fastsac_g1_29dof.onnx",
|
||||
"ppo": "ppo_g1_29dof.onnx",
|
||||
}
|
||||
|
||||
|
||||
def load_policy(
|
||||
repo_id: str = DEFAULT_HOLOSOMA_REPO_ID,
|
||||
policy_type: str = "fastsac",
|
||||
) -> tuple[ort.InferenceSession, np.ndarray, np.ndarray]:
|
||||
"""Load Holosoma locomotion policy and extract KP/KD from metadata.
|
||||
|
||||
Args:
|
||||
repo_id: Hugging Face Hub repo ID
|
||||
policy_type: Either "fastsac" (default) or "ppo"
|
||||
|
||||
Returns:
|
||||
(policy, kp, kd) tuple
|
||||
"""
|
||||
if policy_type not in POLICY_FILES:
|
||||
raise ValueError(f"Unknown policy type: {policy_type}. Choose from: {list(POLICY_FILES.keys())}")
|
||||
|
||||
filename = POLICY_FILES[policy_type]
|
||||
logger.info(f"Loading {policy_type.upper()} policy from: {repo_id}/{filename}")
|
||||
policy_path = hf_hub_download(repo_id=repo_id, filename=filename)
|
||||
|
||||
policy = ort.InferenceSession(policy_path)
|
||||
logger.info(f"Policy loaded: {policy.get_inputs()[0].shape} → {policy.get_outputs()[0].shape}")
|
||||
|
||||
# Extract KP/KD from ONNX metadata
|
||||
model = onnx.load(policy_path)
|
||||
metadata = {prop.key: prop.value for prop in model.metadata_props}
|
||||
|
||||
if "kp" not in metadata or "kd" not in metadata:
|
||||
raise ValueError("ONNX model must contain 'kp' and 'kd' in metadata")
|
||||
|
||||
kp = np.array(json.loads(metadata["kp"]), dtype=np.float32)
|
||||
kd = np.array(json.loads(metadata["kd"]), dtype=np.float32)
|
||||
logger.info(f"Loaded KP/KD from ONNX ({len(kp)} joints)")
|
||||
|
||||
return policy, kp, kd
|
||||
|
||||
|
||||
class HolosomaLocomotionController:
|
||||
"""Holosoma whole-body locomotion controller for Unitree G1."""
|
||||
|
||||
def __init__(self, policy, robot, kp: np.ndarray, kd: np.ndarray):
|
||||
self.policy = policy
|
||||
self.robot = robot
|
||||
|
||||
# Override robot's PD gains with policy gains
|
||||
self.robot.kp = kp
|
||||
self.robot.kd = kd
|
||||
|
||||
self.cmd = np.zeros(3, dtype=np.float32)
|
||||
|
||||
# Robot state
|
||||
self.qj = np.zeros(29, dtype=np.float32)
|
||||
self.dqj = np.zeros(29, dtype=np.float32)
|
||||
self.obs = np.zeros(100, dtype=np.float32)
|
||||
self.last_action = np.zeros(29, dtype=np.float32)
|
||||
|
||||
# Gait phase
|
||||
self.phase = np.array([[0.0, np.pi]], dtype=np.float32)
|
||||
self.phase_dt = 2 * np.pi / ((1.0 / CONTROL_DT) * GAIT_PERIOD)
|
||||
self.is_standing = True
|
||||
|
||||
def run_step(self):
|
||||
# Get current observation
|
||||
obs = self.robot.get_observation()
|
||||
|
||||
if not obs:
|
||||
return
|
||||
|
||||
# Get command from remote controller
|
||||
ly = obs["remote.ly"] if abs(obs["remote.ly"]) > 0.1 else 0.0
|
||||
lx = obs["remote.lx"] if abs(obs["remote.lx"]) > 0.1 else 0.0
|
||||
rx = obs["remote.rx"] if abs(obs["remote.rx"]) > 0.1 else 0.0
|
||||
self.cmd[:] = [ly, -lx, -rx]
|
||||
|
||||
# Get joint positions and velocities
|
||||
for motor in G1_29_JointIndex:
|
||||
name = motor.name
|
||||
idx = motor.value
|
||||
self.qj[idx] = obs[f"{name}.q"]
|
||||
self.dqj[idx] = obs[f"{name}.dq"]
|
||||
|
||||
# Adapt observation for g1_23dof
|
||||
for idx in MISSING_JOINTS:
|
||||
self.qj[idx] = 0.0
|
||||
self.dqj[idx] = 0.0
|
||||
|
||||
# Express IMU data in gravity frame of reference
|
||||
quat = [obs["imu.quat.w"], obs["imu.quat.x"], obs["imu.quat.y"], obs["imu.quat.z"]]
|
||||
ang_vel = np.array([obs["imu.gyro.x"], obs["imu.gyro.y"], obs["imu.gyro.z"]], dtype=np.float32)
|
||||
gravity = self.robot.get_gravity_orientation(quat)
|
||||
|
||||
# Scale joint positions and velocities before policy inference
|
||||
qj_obs = (self.qj - DEFAULT_ANGLES) * DOF_POS_SCALE
|
||||
dqj_obs = self.dqj * DOF_VEL_SCALE
|
||||
ang_vel_s = ang_vel * ANG_VEL_SCALE
|
||||
|
||||
# Update gait phase
|
||||
if np.linalg.norm(self.cmd[:2]) < 0.01 and abs(self.cmd[2]) < 0.01:
|
||||
self.phase[0, :] = np.pi
|
||||
self.is_standing = True
|
||||
elif self.is_standing:
|
||||
self.phase = np.array([[0.0, np.pi]], dtype=np.float32)
|
||||
self.is_standing = False
|
||||
else:
|
||||
self.phase = np.fmod(self.phase + self.phase_dt + np.pi, 2 * np.pi) - np.pi
|
||||
|
||||
sin_ph = np.sin(self.phase[0])
|
||||
cos_ph = np.cos(self.phase[0])
|
||||
|
||||
# Build observations
|
||||
self.obs[0:29] = self.last_action
|
||||
self.obs[29:32] = ang_vel_s
|
||||
self.obs[32] = self.cmd[2]
|
||||
self.obs[33:35] = self.cmd[:2]
|
||||
self.obs[35:37] = cos_ph
|
||||
self.obs[37:66] = qj_obs
|
||||
self.obs[66:95] = dqj_obs
|
||||
self.obs[95:98] = gravity
|
||||
self.obs[98:100] = sin_ph
|
||||
|
||||
# Run policy inference
|
||||
ort_in = {self.policy.get_inputs()[0].name: self.obs.reshape(1, -1).astype(np.float32)}
|
||||
raw_action = self.policy.run(None, ort_in)[0].squeeze()
|
||||
action = np.clip(raw_action, -100.0, 100.0)
|
||||
self.last_action = action.copy()
|
||||
|
||||
# Transform action back to target joint positions
|
||||
target = DEFAULT_ANGLES + action * ACTION_SCALE
|
||||
|
||||
# Build action dict
|
||||
action_dict = {}
|
||||
for motor in G1_29_JointIndex:
|
||||
action_dict[f"{motor.name}.q"] = float(target[motor.value])
|
||||
|
||||
# Zero out missing joints for g1_23dof
|
||||
for joint_idx in MISSING_JOINTS:
|
||||
motor_name = G1_29_JointIndex(joint_idx).name
|
||||
action_dict[f"{motor_name}.q"] = 0.0
|
||||
|
||||
# Send action to robot
|
||||
self.robot.send_action(action_dict)
|
||||
|
||||
|
||||
def run(repo_id: str = DEFAULT_HOLOSOMA_REPO_ID, policy_type: str = "fastsac") -> None:
|
||||
"""Main function to run the Holosoma locomotion controller.
|
||||
|
||||
Args:
|
||||
repo_id: Hugging Face Hub repository ID for Holosoma policies.
|
||||
policy_type: Policy type to use ('fastsac' or 'ppo').
|
||||
"""
|
||||
# Load policy and gains
|
||||
policy, kp, kd = load_policy(repo_id=repo_id, policy_type=policy_type)
|
||||
|
||||
# Initialize robot
|
||||
config = UnitreeG1Config()
|
||||
robot = UnitreeG1(config)
|
||||
robot.connect()
|
||||
|
||||
holosoma_controller = HolosomaLocomotionController(policy, robot, kp, kd)
|
||||
|
||||
try:
|
||||
robot.reset(CONTROL_DT, DEFAULT_ANGLES)
|
||||
|
||||
logger.info("Use joystick: LY=fwd/back, LX=left/right, RX=rotate")
|
||||
logger.info("Press Ctrl+C to stop")
|
||||
|
||||
# Run step
|
||||
while not robot._shutdown_event.is_set():
|
||||
start_time = time.time()
|
||||
holosoma_controller.run_step()
|
||||
elapsed = time.time() - start_time
|
||||
sleep_time = max(0, CONTROL_DT - elapsed)
|
||||
time.sleep(sleep_time)
|
||||
except KeyboardInterrupt:
|
||||
logger.info("Stopping locomotion...")
|
||||
finally:
|
||||
if robot.is_connected:
|
||||
robot.disconnect()
|
||||
logger.info("Done!")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description="Holosoma Locomotion Controller for Unitree G1")
|
||||
parser.add_argument(
|
||||
"--repo-id",
|
||||
type=str,
|
||||
default=DEFAULT_HOLOSOMA_REPO_ID,
|
||||
help=f"Hugging Face Hub repo ID for Holosoma policies (default: {DEFAULT_HOLOSOMA_REPO_ID})",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--policy",
|
||||
type=str,
|
||||
choices=["fastsac", "ppo"],
|
||||
default="fastsac",
|
||||
help="Policy type to use: 'fastsac' (default) or 'ppo'",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
run(repo_id=args.repo_id, policy_type=args.policy)
|
||||
@@ -0,0 +1,47 @@
|
||||
# Voice Assistant Examples
|
||||
|
||||
Voice-enabled robot assistant examples using speech-to-text (STT), and text-to-speech (TTS).
|
||||
|
||||
## Overview
|
||||
|
||||
These examples demonstrate how to build a voice interface for robot control:
|
||||
|
||||
1. **Hold SPACE** → Push-to-talk recording starts
|
||||
2. **Release SPACE** → Recording stops
|
||||
3. **STT (Whisper)** → Converts speech to text (high-level task prompt)
|
||||
4. **Pi0.5** → Generates robot response/utterance
|
||||
5. **TTS (Kokoro)** → Speaks the response back
|
||||
|
||||
## Requirements
|
||||
|
||||
```bash
|
||||
pip install torch transformers sounddevice numpy pynput kokoro>=0.9.2
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### With Pi0.5 Model
|
||||
|
||||
```bash
|
||||
python examples/voice_assistant/voice_assistant_pi05.py \
|
||||
--pretrained_path path/to/pi05/checkpoint
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
### Pi0.5 Voice Integration
|
||||
|
||||
Pi0.5 can generate robot utterances as part of its subtask prediction. The flow:
|
||||
|
||||
1. **High-level prompt**: User voice command is transcribed and formatted as a task prompt
|
||||
2. **Subtask generation**: Pi0.5 autoregressively generates a response
|
||||
3. **Utterance extraction**: If the response contains `<utterance>...</utterance>` tags, the content is extracted
|
||||
4. **TTS output**: The response is spoken back to the user
|
||||
|
||||
## Configuration Options
|
||||
|
||||
| Option | Default | Description |
|
||||
|--------|---------|-------------|
|
||||
| `--pretrained_path` | None | Path to Pi0.5 checkpoint |
|
||||
| `--record_seconds` | 5.0 | Audio recording duration |
|
||||
| `--max_response_tokens` | 100 | Max tokens in generated response |
|
||||
@@ -0,0 +1,336 @@
|
||||
#!/usr/bin/env python
|
||||
"""
|
||||
Voice Assistant with Pi0.5: Microphone → STT → Pi0.5 → TTS → Speaker
|
||||
|
||||
This example demonstrates how to use Pi0.5 as a conversational robot assistant:
|
||||
1. Hold SPACE to record your voice command
|
||||
2. Speech-to-text (Whisper) converts speech to text
|
||||
3. Text is fed as a high-level prompt to Pi0.5
|
||||
4. Pi0.5 generates a response (robot utterance)
|
||||
5. Text-to-speech (Kokoro) speaks the response back
|
||||
|
||||
Requirements:
|
||||
pip install torch transformers sounddevice numpy pynput kokoro>=0.9.2
|
||||
|
||||
Usage:
|
||||
python examples/voice_assistant/voice_assistant_pi05.py \
|
||||
--pretrained_path lerobot/pi0.5-base
|
||||
"""
|
||||
|
||||
import os
|
||||
|
||||
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
||||
|
||||
import argparse
|
||||
import re
|
||||
import subprocess
|
||||
import threading
|
||||
import time
|
||||
|
||||
import numpy as np
|
||||
import sounddevice as sd
|
||||
import torch
|
||||
from pynput import keyboard
|
||||
from transformers import AutoTokenizer, WhisperForConditionalGeneration, WhisperProcessor
|
||||
|
||||
from lerobot.policies.pi05.configuration_pi05 import PI05Config
|
||||
from lerobot.policies.pi05.modeling_pi05 import PI05Pytorch
|
||||
|
||||
SAMPLE_RATE = 16000
|
||||
|
||||
|
||||
def get_device():
|
||||
if torch.cuda.is_available():
|
||||
return torch.device("cuda")
|
||||
elif torch.backends.mps.is_available():
|
||||
return torch.device("mps")
|
||||
return torch.device("cpu")
|
||||
|
||||
|
||||
class Pi05VoiceAssistant:
|
||||
"""Voice assistant using Pi0.5 for generating robot utterances."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
pretrained_path: str | None = None,
|
||||
max_response_tokens: int = 100,
|
||||
max_record_seconds: float = 30.0,
|
||||
):
|
||||
self.device = get_device()
|
||||
self.dtype = torch.float32 if self.device.type == "mps" else torch.bfloat16
|
||||
self.max_response_tokens = max_response_tokens
|
||||
self.max_record_seconds = max_record_seconds
|
||||
|
||||
# Push-to-talk state
|
||||
self._recording = False
|
||||
self._audio_chunks: list[np.ndarray] = []
|
||||
self._stream: sd.InputStream | None = None
|
||||
|
||||
print(f"Using device: {self.device}")
|
||||
self._load_models(pretrained_path)
|
||||
|
||||
def _load_models(self, pretrained_path: str | None):
|
||||
print("Loading STT (Whisper tiny)...")
|
||||
self.stt_processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en")
|
||||
self.stt_model = WhisperForConditionalGeneration.from_pretrained(
|
||||
"openai/whisper-tiny.en", torch_dtype=self.dtype
|
||||
).to(self.device)
|
||||
|
||||
print("Loading Pi0.5 model...")
|
||||
self._load_pi05(pretrained_path)
|
||||
|
||||
print("Loading tokenizer...")
|
||||
self.tokenizer = AutoTokenizer.from_pretrained("google/paligemma-3b-pt-224")
|
||||
|
||||
self._load_tts()
|
||||
print("Ready!\n")
|
||||
|
||||
def _load_pi05(self, pretrained_path: str | None):
|
||||
"""Load Pi0.5 model for utterance generation."""
|
||||
config = PI05Config()
|
||||
config.dtype = "float32" if self.device.type == "mps" else "bfloat16"
|
||||
|
||||
self.pi05_model = PI05Pytorch(config)
|
||||
|
||||
if pretrained_path:
|
||||
try:
|
||||
from safetensors.torch import load_file
|
||||
state_dict = load_file(f"{pretrained_path}/model.safetensors")
|
||||
self.pi05_model.load_state_dict(state_dict, strict=False)
|
||||
print(f"✓ Loaded Pi0.5 weights from {pretrained_path}")
|
||||
except Exception as e:
|
||||
print(f"Warning: Could not load pretrained weights: {e}")
|
||||
print("Using randomly initialized model for demo purposes")
|
||||
|
||||
self.pi05_model = self.pi05_model.to(self.device)
|
||||
self.pi05_model.eval()
|
||||
|
||||
def _load_tts(self):
|
||||
try:
|
||||
print("Loading TTS (Kokoro 82M)...")
|
||||
from kokoro import KPipeline
|
||||
|
||||
self.tts_pipeline = KPipeline(lang_code="a") # American English
|
||||
self.tts_voice = "af_heart"
|
||||
self.tts_type = "kokoro"
|
||||
print("Kokoro loaded!")
|
||||
except Exception as e:
|
||||
print(f"Kokoro not available ({e})")
|
||||
print("Using macOS `say` for TTS")
|
||||
self.tts_pipeline = None
|
||||
self.tts_type = "system"
|
||||
|
||||
def _audio_callback(self, indata, frames, time_info, status):
|
||||
"""Callback for audio stream - collects chunks while recording."""
|
||||
if self._recording:
|
||||
self._audio_chunks.append(indata.copy())
|
||||
|
||||
def _start_recording(self):
|
||||
"""Start recording audio."""
|
||||
if self._recording:
|
||||
return
|
||||
self._recording = True
|
||||
self._audio_chunks = []
|
||||
print("🎤 Recording... (release SPACE to stop)")
|
||||
|
||||
def _stop_recording(self) -> np.ndarray | None:
|
||||
"""Stop recording and return the audio."""
|
||||
if not self._recording:
|
||||
return None
|
||||
self._recording = False
|
||||
|
||||
if not self._audio_chunks:
|
||||
return None
|
||||
|
||||
audio = np.concatenate(self._audio_chunks, axis=0).flatten()
|
||||
duration = len(audio) / SAMPLE_RATE
|
||||
volume = np.abs(audio).max()
|
||||
print(f"Recorded {duration:.1f}s, volume: {volume:.4f}")
|
||||
|
||||
if volume < 0.001:
|
||||
print("⚠️ Very low audio - check microphone permissions!")
|
||||
return None
|
||||
|
||||
return audio
|
||||
|
||||
def wait_for_spacebar(self) -> np.ndarray | None:
|
||||
"""Wait for spacebar press, record while held, return audio on release."""
|
||||
audio_result = None
|
||||
recording_done = threading.Event()
|
||||
|
||||
def on_press(key):
|
||||
if key == keyboard.Key.space:
|
||||
self._start_recording()
|
||||
|
||||
def on_release(key):
|
||||
nonlocal audio_result
|
||||
if key == keyboard.Key.space and self._recording:
|
||||
audio_result = self._stop_recording()
|
||||
recording_done.set()
|
||||
return False # Stop listener
|
||||
|
||||
# Start audio stream
|
||||
self._stream = sd.InputStream(
|
||||
samplerate=SAMPLE_RATE,
|
||||
channels=1,
|
||||
dtype="float32",
|
||||
callback=self._audio_callback,
|
||||
blocksize=int(SAMPLE_RATE * 0.1), # 100ms blocks
|
||||
)
|
||||
|
||||
with self._stream:
|
||||
print("\n⏳ Press and hold SPACE to speak...")
|
||||
with keyboard.Listener(on_press=on_press, on_release=on_release) as listener:
|
||||
# Wait for recording to complete or timeout
|
||||
recording_done.wait(timeout=self.max_record_seconds)
|
||||
if self._recording:
|
||||
audio_result = self._stop_recording()
|
||||
|
||||
return audio_result
|
||||
|
||||
def transcribe(self, audio: np.ndarray) -> str:
|
||||
start = time.perf_counter()
|
||||
inputs = self.stt_processor(audio, sampling_rate=SAMPLE_RATE, return_tensors="pt")
|
||||
input_features = inputs.input_features.to(self.device, dtype=self.dtype)
|
||||
tokens = self.stt_model.generate(input_features)
|
||||
text = self.stt_processor.batch_decode(tokens, skip_special_tokens=True)[0]
|
||||
print(f"STT: {time.perf_counter() - start:.2f}s")
|
||||
return text.strip()
|
||||
|
||||
def _create_dummy_images(self, batch_size: int = 1) -> tuple[list[torch.Tensor], list[torch.Tensor]]:
|
||||
"""Create placeholder images for Pi0.5 when no camera is available."""
|
||||
image_shape = (batch_size, 3, 224, 224)
|
||||
dummy_image = torch.zeros(image_shape, dtype=torch.float32, device=self.device)
|
||||
dummy_mask = torch.ones(batch_size, dtype=torch.bool, device=self.device)
|
||||
return [dummy_image], [dummy_mask]
|
||||
|
||||
def _tokenize_prompt(self, text: str) -> tuple[torch.Tensor, torch.Tensor]:
|
||||
"""Tokenize the user prompt for Pi0.5."""
|
||||
prompt = f"User request: {text}\nRobot response:"
|
||||
tokenized = self.tokenizer(
|
||||
[prompt],
|
||||
max_length=200,
|
||||
truncation=True,
|
||||
padding="max_length",
|
||||
return_tensors="pt",
|
||||
)
|
||||
tokens = tokenized["input_ids"].to(self.device)
|
||||
masks = tokenized["attention_mask"].to(self.device, dtype=torch.bool)
|
||||
return tokens, masks
|
||||
|
||||
def generate_response(self, user_text: str) -> str:
|
||||
"""Generate robot utterance using Pi0.5's language generation."""
|
||||
start = time.perf_counter()
|
||||
|
||||
images, img_masks = self._create_dummy_images()
|
||||
tokens, masks = self._tokenize_prompt(user_text)
|
||||
|
||||
with torch.no_grad():
|
||||
generated_tokens = self.pi05_model._generate_subtask_tokens(
|
||||
images=images,
|
||||
img_masks=img_masks,
|
||||
tokens=tokens,
|
||||
masks=masks,
|
||||
tokenizer=self.tokenizer,
|
||||
max_length=self.max_response_tokens,
|
||||
device=self.device,
|
||||
)
|
||||
|
||||
# Decode generated tokens
|
||||
valid_tokens = generated_tokens[0][generated_tokens[0] != 0]
|
||||
response = self.tokenizer.decode(valid_tokens, skip_special_tokens=True)
|
||||
|
||||
# Extract utterance if marked with special tokens
|
||||
response = self._extract_utterance(response)
|
||||
|
||||
print(f"Pi0.5: {time.perf_counter() - start:.2f}s")
|
||||
return response.strip()
|
||||
|
||||
def _extract_utterance(self, text: str) -> str:
|
||||
"""Extract utterance from between <utterance> tokens if present."""
|
||||
pattern = r"<utterance>(.*?)</utterance>"
|
||||
match = re.search(pattern, text, re.DOTALL)
|
||||
if match:
|
||||
return match.group(1).strip()
|
||||
return text
|
||||
|
||||
def speak(self, text: str):
|
||||
start = time.perf_counter()
|
||||
if self.tts_type == "kokoro":
|
||||
generator = self.tts_pipeline(text, voice=self.tts_voice)
|
||||
audio_chunks = [audio for _, _, audio in generator]
|
||||
if audio_chunks:
|
||||
audio = np.concatenate(audio_chunks)
|
||||
sd.play(audio, 24000)
|
||||
sd.wait()
|
||||
else:
|
||||
subprocess.run(["say", text], check=True)
|
||||
print(f"TTS: {time.perf_counter() - start:.2f}s")
|
||||
|
||||
def run(self):
|
||||
print("=" * 50)
|
||||
print("Pi0.5 Voice Assistant")
|
||||
print("=" * 50)
|
||||
print("• Hold SPACE to record your voice command")
|
||||
print("• Release SPACE when done speaking")
|
||||
print("• Press Ctrl+C to exit")
|
||||
print("=" * 50)
|
||||
|
||||
while True:
|
||||
try:
|
||||
audio = self.wait_for_spacebar()
|
||||
|
||||
if audio is None:
|
||||
print("(no audio captured)\n")
|
||||
continue
|
||||
|
||||
user_text = self.transcribe(audio)
|
||||
|
||||
if not user_text:
|
||||
print("(no speech detected)\n")
|
||||
continue
|
||||
|
||||
print(f"You: {user_text}")
|
||||
|
||||
response = self.generate_response(user_text)
|
||||
print(f"Robot: {response}\n")
|
||||
|
||||
self.speak(response)
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print("\nGoodbye!")
|
||||
break
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Pi0.5 Voice Assistant")
|
||||
parser.add_argument(
|
||||
"--pretrained_path",
|
||||
type=str,
|
||||
default=None,
|
||||
help="Path to pretrained Pi0.5 model (optional)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--max_response_tokens",
|
||||
type=int,
|
||||
default=100,
|
||||
help="Maximum tokens in generated response",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--max_record_seconds",
|
||||
type=float,
|
||||
default=30.0,
|
||||
help="Maximum recording duration in seconds",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
assistant = Pi05VoiceAssistant(
|
||||
pretrained_path=args.pretrained_path,
|
||||
max_response_tokens=args.max_response_tokens,
|
||||
max_record_seconds=args.max_record_seconds,
|
||||
)
|
||||
assistant.run()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,27 @@
|
||||
{
|
||||
"repo_id": "local",
|
||||
"vocab_size": 1024,
|
||||
"scale": 10.0,
|
||||
"encoded_dims": "0:7",
|
||||
"encoded_dim_ranges": [
|
||||
[
|
||||
0,
|
||||
7
|
||||
]
|
||||
],
|
||||
"total_encoded_dims": 7,
|
||||
"delta_dims": null,
|
||||
"delta_dim_list": null,
|
||||
"use_delta_transform": false,
|
||||
"state_key": "observation.state",
|
||||
"normalization_mode": "QUANTILES",
|
||||
"action_horizon": 10,
|
||||
"num_training_chunks": 25065,
|
||||
"compression_stats": {
|
||||
"compression_ratio": 3.464660463274599,
|
||||
"mean_token_length": 20.204,
|
||||
"p99_token_length": 36.00999999999999,
|
||||
"min_token_length": 5.0,
|
||||
"max_token_length": 38.0
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,158 @@
|
||||
import logging
|
||||
from typing import ClassVar
|
||||
|
||||
import numpy as np
|
||||
from scipy.fft import dct
|
||||
from scipy.fft import idct
|
||||
from tokenizers import ByteLevelBPETokenizer
|
||||
from tokenizers.trainers import BpeTrainer
|
||||
from transformers import PreTrainedTokenizerFast
|
||||
from transformers.processing_utils import ProcessorMixin
|
||||
|
||||
|
||||
class UniversalActionProcessor(ProcessorMixin):
|
||||
attributes: ClassVar[list[str]] = ["bpe_tokenizer"]
|
||||
bpe_tokenizer_class: str = "AutoTokenizer"
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
bpe_tokenizer: PreTrainedTokenizerFast,
|
||||
scale: float = 10,
|
||||
vocab_size: int = 1024,
|
||||
min_token: int = 0,
|
||||
*,
|
||||
action_dim: int | None = None,
|
||||
time_horizon: int | None = None,
|
||||
):
|
||||
self.scale = scale
|
||||
self.vocab_size = vocab_size
|
||||
self.min_token = min_token
|
||||
|
||||
# Action horizon and dimension needed during decoding. These can be specified
|
||||
# in three ways (in order of priority):
|
||||
# 1. passed in as kwargs to decode()
|
||||
# 2. in the constructor
|
||||
# 3. cached from the last time decode() was called
|
||||
self.time_horizon = time_horizon
|
||||
self.action_dim = action_dim
|
||||
self.called_time_horizon = time_horizon
|
||||
self.called_action_dim = action_dim
|
||||
|
||||
super().__init__(bpe_tokenizer)
|
||||
|
||||
def __call__(self, action_chunk: np.array) -> np.array:
|
||||
assert action_chunk.ndim <= 3, "Only 3 dimensions supported: [batch, timesteps, action_dim]"
|
||||
if action_chunk.ndim == 2:
|
||||
action_chunk = action_chunk[None, ...]
|
||||
|
||||
# Cache the time horizon and action dimension for decoding
|
||||
self.called_time_horizon = action_chunk.shape[-2]
|
||||
self.called_action_dim = action_chunk.shape[-1]
|
||||
|
||||
dct_coeff = dct(action_chunk, axis=1, norm="ortho")
|
||||
dct_coeff = np.around(dct_coeff * self.scale)
|
||||
tokens = []
|
||||
for elem in dct_coeff:
|
||||
token_str = "".join(map(chr, np.maximum(elem.flatten() - self.min_token, 0).astype(int)))
|
||||
tokens.append(self.bpe_tokenizer(token_str)["input_ids"])
|
||||
return tokens
|
||||
|
||||
def decode(
|
||||
self,
|
||||
tokens: list[list[int]],
|
||||
*,
|
||||
time_horizon: int | None = None,
|
||||
action_dim: int | None = None,
|
||||
) -> np.array:
|
||||
self.time_horizon = time_horizon or self.time_horizon or self.called_time_horizon
|
||||
self.action_dim = action_dim or self.action_dim or self.called_action_dim
|
||||
|
||||
# Cache the time horizon and action dimension for the next call
|
||||
self.called_time_horizon = self.time_horizon
|
||||
self.called_action_dim = self.action_dim
|
||||
|
||||
assert (
|
||||
self.time_horizon is not None and self.action_dim is not None
|
||||
), "Tokenizer not initialized, call encode() once or pass in time_horizon and action_dim."
|
||||
|
||||
decoded_actions = []
|
||||
for token in tokens:
|
||||
try:
|
||||
decoded_tokens = self.bpe_tokenizer.decode(token)
|
||||
decoded_dct_coeff = np.array(list(map(ord, decoded_tokens))) + self.min_token
|
||||
decoded_dct_coeff = decoded_dct_coeff.reshape(-1, self.action_dim)
|
||||
assert (
|
||||
decoded_dct_coeff.shape
|
||||
== (
|
||||
self.time_horizon,
|
||||
self.action_dim,
|
||||
)
|
||||
), f"Decoded DCT coefficients have shape {decoded_dct_coeff.shape}, expected ({self.time_horizon}, {self.action_dim})"
|
||||
except Exception as e:
|
||||
print(f"Error decoding tokens: {e}")
|
||||
print(f"Tokens: {token}")
|
||||
decoded_dct_coeff = np.zeros((self.time_horizon, self.action_dim))
|
||||
decoded_actions.append(idct(decoded_dct_coeff / self.scale, axis=0, norm="ortho"))
|
||||
return np.stack(decoded_actions)
|
||||
|
||||
@classmethod
|
||||
def fit(
|
||||
cls,
|
||||
action_data: list[np.array],
|
||||
scale: float = 10,
|
||||
vocab_size: int = 1024,
|
||||
*,
|
||||
time_horizon: int | None = None,
|
||||
action_dim: int | None = None,
|
||||
) -> "UniversalActionProcessor":
|
||||
# Run DCT over all inputs
|
||||
dct_tokens = [dct(a, axis=0, norm="ortho").flatten() for a in action_data]
|
||||
|
||||
# Quantize and find min token
|
||||
max_token = int(np.around(np.concatenate(dct_tokens) * scale).max())
|
||||
min_token = int(np.around(np.concatenate(dct_tokens) * scale).min())
|
||||
min_vocab_size = max_token - min_token
|
||||
|
||||
assert (
|
||||
min_vocab_size <= vocab_size
|
||||
), f"Vocab size {vocab_size} is too small for the range of tokens {min_vocab_size}"
|
||||
if min_vocab_size + 100 > vocab_size:
|
||||
logging.warning(
|
||||
f"Initial alphabet size {min_vocab_size} is almost as large as the vocab"
|
||||
f"size {vocab_size}, consider increasing vocab size"
|
||||
)
|
||||
|
||||
# Make token iterator for BPE training
|
||||
def _token_iter():
|
||||
for tokens in dct_tokens:
|
||||
rounded_tokens = np.around(tokens * scale) - min_token
|
||||
rounded_tokens = rounded_tokens.astype(int)
|
||||
string = "".join(map(chr, rounded_tokens))
|
||||
yield string
|
||||
|
||||
# Train BPE tokenizer
|
||||
bpe = ByteLevelBPETokenizer()
|
||||
|
||||
# Set up the entire range of possible tokens as the initial alphabet
|
||||
alphabet = [chr(i) for i in range(max_token - min_token + 1)]
|
||||
trainer = BpeTrainer(
|
||||
vocab_size=vocab_size,
|
||||
min_frequency=2,
|
||||
show_progress=True,
|
||||
special_tokens=[],
|
||||
initial_alphabet=alphabet,
|
||||
max_token_length=10000,
|
||||
)
|
||||
|
||||
# Train the inner tokenizer (don't use ByteLevelBPETokenizer.train_from_iterator()
|
||||
# because it doesn't support custom alphabets)
|
||||
bpe._tokenizer.train_from_iterator(_token_iter(), trainer=trainer)
|
||||
|
||||
return cls(
|
||||
PreTrainedTokenizerFast(tokenizer_object=bpe, clean_up_tokenization_spaces=False),
|
||||
scale=scale,
|
||||
vocab_size=vocab_size,
|
||||
min_token=min_token,
|
||||
time_horizon=time_horizon,
|
||||
action_dim=action_dim,
|
||||
)
|
||||
@@ -0,0 +1,11 @@
|
||||
{
|
||||
"action_dim": 7,
|
||||
"auto_map": {
|
||||
"AutoProcessor": "processing_action_tokenizer.UniversalActionProcessor"
|
||||
},
|
||||
"min_token": -32,
|
||||
"processor_class": "UniversalActionProcessor",
|
||||
"scale": 10.0,
|
||||
"time_horizon": 10,
|
||||
"vocab_size": 1024
|
||||
}
|
||||
@@ -0,0 +1 @@
|
||||
{}
|
||||
@@ -0,0 +1,11 @@
|
||||
{
|
||||
"added_tokens_decoder": {},
|
||||
"auto_map": {
|
||||
"AutoProcessor": "processing_action_tokenizer.UniversalActionProcessor"
|
||||
},
|
||||
"clean_up_tokenization_spaces": false,
|
||||
"extra_special_tokens": {},
|
||||
"model_max_length": 1000000000000000019884624838656,
|
||||
"processor_class": "UniversalActionProcessor",
|
||||
"tokenizer_class": "PreTrainedTokenizerFast"
|
||||
}
|
||||
|
After Width: | Height: | Size: 2.9 MiB |
|
After Width: | Height: | Size: 185 KiB |
|
After Width: | Height: | Size: 464 KiB |
|
After Width: | Height: | Size: 72 KiB |
|
After Width: | Height: | Size: 219 KiB |
|
After Width: | Height: | Size: 199 KiB |