mirror of
https://github.com/huggingface/lerobot.git
synced 2026-06-19 01:07:18 +00:00
Update lingbot_va.mdx
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
This commit is contained in:
+13
-23
@@ -11,12 +11,11 @@ interfaces.
|
||||
LingBot-VA is a **dual-stream "mixture-of-transformers"**: a video/latent stream
|
||||
(`patch_embedding_mlp → blocks → proj_out`) and an action stream
|
||||
(`action_embedder → blocks → action_proj_out`) share the same 30 transformer blocks and
|
||||
text conditioning. Actions are produced by the dedicated `action_proj_out` head — they are
|
||||
**not** decoded from predicted pixels, though video and action are co-trained.
|
||||
text conditioning.
|
||||
|
||||
| Component | Class | Role |
|
||||
| ------------------------ | ----------------------- | -------------------------------------------------------------------------------------- |
|
||||
| DiT backbone (trainable) | `WanTransformer3DModel` | ~5B-param dual-stream transformer (the only weights stored in the LeRobot checkpoint). |
|
||||
| DiT backbone (trainable) | `WanTransformer3DModel` | ~5B-param dual-stream transformer. |
|
||||
| VAE (frozen) | `AutoencoderKLWan` | Wan2.2 VAE, `z_dim=48`. Lazy-pulled from the source repo. |
|
||||
| Text encoder (frozen) | `UMT5EncoderModel` | UMT5-XXL, `d_model=4096`. Lazy-pulled from the source repo. |
|
||||
|
||||
@@ -38,12 +37,10 @@ fed back into the KV cache as the chunk is executed (closed-loop world modeling)
|
||||
## Installation
|
||||
|
||||
1. Install LeRobot by following the [Installation Guide](./installation).
|
||||
2. Install the LingBot-VA extra (brings in `diffusers>=0.36` for the Wan2.2 stack):
|
||||
2. Install the LingBot-VA extra:
|
||||
|
||||
```bash
|
||||
pip install -e ".[lingbot_va]"
|
||||
# For LIBERO evaluation (Linux only):
|
||||
pip install -e ".[lingbot_va,libero]"
|
||||
```
|
||||
|
||||
## Checkpoints
|
||||
@@ -52,12 +49,12 @@ The released upstream checkpoints have been converted to LeRobot format and push
|
||||
|
||||
| Variant | LeRobot checkpoint |
|
||||
| ---------------------- | ---------------------------------- |
|
||||
| LIBERO-Long post-train | `pepijn223/lingbot_va_libero_long` |
|
||||
| RoboTwin post-train | `pepijn223/lingbot_va_robotwin` |
|
||||
| Pretrained base | `pepijn223/lingbot_va_base` |
|
||||
| LIBERO-Long post-train | `lerobot/lingbot_va_libero_long` |
|
||||
| RoboTwin post-train | `lerobot/lingbot_va_robotwin` |
|
||||
| Pretrained base | `lerobot/lingbot_va_base` |
|
||||
|
||||
**Packaging:** only the trainable ~5B transformer is stored in the LeRobot
|
||||
`model.safetensors`. The frozen VAE + UMT5 + tokenizer (~20 GB) are **lazily pulled** from
|
||||
Only the trainable ~5B transformer is stored in the LeRobot
|
||||
`model.safetensors`. The frozen VAE + UMT5 + tokenizer (~20 GB) are pulled from
|
||||
`config.wan_pretrained_path` at load time (defaults to the source `robbyant/*` repo). The
|
||||
UMT5-XXL text encoder runs on CPU by default (`config.text_encoder_device`) so the 5B
|
||||
transformer + VAE fit on a single 24–32 GB GPU.
|
||||
@@ -74,14 +71,12 @@ lerobot-eval \
|
||||
--output_dir=outputs/eval/lingbot_va_libero
|
||||
```
|
||||
|
||||
Native LeRobot eval reproduces **96% success on `libero_10` (LIBERO-Long)** (48/50 episodes).
|
||||
|
||||
LingBot-VA's streaming inference (KV cache + observed-keyframe feedback) is implemented for
|
||||
single-environment eval; use `--eval.batch_size=1`.
|
||||
|
||||
## Evaluation (RoboTwin)
|
||||
|
||||
RoboTwin 2.0 needs the SAPIEN + CuRobo simulator stack — use the benchmark Docker image
|
||||
RoboTwin 2.0 needs the SAPIEN + CuRobo simulator stack. You can use the benchmark Docker image
|
||||
(`docker/Dockerfile.benchmark.robotwin`, which also needs `warp-lang==1.3.1` and CuRobo built
|
||||
with the GPU's compute capability in `TORCH_CUDA_ARCH_LIST`). RoboTwin uses **end-effector-pose
|
||||
control**, so run with `--env.action_mode=ee`: the policy predicts per-arm `xyz+quaternion+gripper`
|
||||
@@ -132,7 +127,7 @@ The dataset must provide camera clips (a temporal window per camera, VAE-encoded
|
||||
|
||||
## Data format (action channels & camera order)
|
||||
|
||||
LingBot-VA is an **end-effector (Cartesian) pose** policy — it predicts EEF poses + gripper, not
|
||||
LingBot-VA is an **end-effector (Cartesian) pose** policy, it predicts EEF poses + gripper, not
|
||||
joint positions. Actions live in a fixed multi-embodiment **30-dim** layout; map your robot's
|
||||
action dimensions into these channels and pad the rest with `0` (`used_action_channel_ids` selects
|
||||
the channels a given checkpoint actually uses):
|
||||
@@ -154,7 +149,7 @@ the channels a given checkpoint actually uses):
|
||||
Joint-space datasets (or a different EEF convention) must be remapped into this schema before
|
||||
fine-tuning these checkpoints.
|
||||
|
||||
**Camera order is fixed and order-sensitive** — per-camera latents are concatenated spatially in
|
||||
**Camera order is fixed and order-sensitive**, per-camera latents are concatenated spatially in
|
||||
`obs_cam_keys` order, so the physical camera→slot mapping must match training:
|
||||
|
||||
| benchmark | `obs_cam_keys` (in order) | `camera_layout` |
|
||||
@@ -162,8 +157,7 @@ fine-tuning these checkpoints.
|
||||
| LIBERO | `observation.images.image` (agentview / 3rd-person), `observation.images.image2` (eye-in-hand wrist) | `width_concat` (latents concatenated on width) |
|
||||
| RoboTwin | `observation.images.head_camera`, `observation.images.left_camera`, `observation.images.right_camera` | `robotwin_tshape` (full-res head below, two half-res wrists on top) |
|
||||
|
||||
The first camera is the exterior/head view and the rest are wrist views; swapping the order (or
|
||||
which physical camera maps to each slot) breaks inference.
|
||||
The first camera is the exterior/head view and the rest are wrist views.
|
||||
|
||||
## Inference Hyperparameters (LIBERO)
|
||||
|
||||
@@ -180,12 +174,8 @@ which physical camera maps to each slot) breaks inference.
|
||||
|
||||
These are the defaults of `LingBotVAConfig`; override any of them via `--policy.<name>=...`.
|
||||
|
||||
## Notes & Limitations
|
||||
## Notes
|
||||
|
||||
- **Correctness gate:** matching the upstream LIBERO success rate requires validating the
|
||||
converted checkpoint on a GPU and tensor-diffing intermediate activations against the
|
||||
upstream implementation. The most sensitive parts are the action quantile normalization,
|
||||
the camera ordering, the `action_per_frame`/`frame_chunk_size` alignment, and `attn_mode`.
|
||||
- **Attention backend:** inference uses the `torch` SDPA backend (always available). The
|
||||
`flashattn` and `flex` backends are optional; `flex` is only needed for training.
|
||||
- **Model size:** the DiT is ~5B params and the frozen VAE+UMT5 add ~20 GB; inference needs
|
||||
|
||||
Reference in New Issue
Block a user