Update lingbot_va.mdx

Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
This commit is contained in:
Pepijn
2026-06-08 10:47:34 +02:00
committed by GitHub
parent be0320a420
commit 5568ce7af1
+13 -23
View File
@@ -11,12 +11,11 @@ interfaces.
LingBot-VA is a **dual-stream "mixture-of-transformers"**: a video/latent stream
(`patch_embedding_mlp → blocks → proj_out`) and an action stream
(`action_embedder → blocks → action_proj_out`) share the same 30 transformer blocks and
text conditioning. Actions are produced by the dedicated `action_proj_out` head — they are
**not** decoded from predicted pixels, though video and action are co-trained.
text conditioning.
| Component | Class | Role |
| ------------------------ | ----------------------- | -------------------------------------------------------------------------------------- |
| DiT backbone (trainable) | `WanTransformer3DModel` | ~5B-param dual-stream transformer (the only weights stored in the LeRobot checkpoint). |
| DiT backbone (trainable) | `WanTransformer3DModel` | ~5B-param dual-stream transformer. |
| VAE (frozen) | `AutoencoderKLWan` | Wan2.2 VAE, `z_dim=48`. Lazy-pulled from the source repo. |
| Text encoder (frozen) | `UMT5EncoderModel` | UMT5-XXL, `d_model=4096`. Lazy-pulled from the source repo. |
@@ -38,12 +37,10 @@ fed back into the KV cache as the chunk is executed (closed-loop world modeling)
## Installation
1. Install LeRobot by following the [Installation Guide](./installation).
2. Install the LingBot-VA extra (brings in `diffusers>=0.36` for the Wan2.2 stack):
2. Install the LingBot-VA extra:
```bash
pip install -e ".[lingbot_va]"
# For LIBERO evaluation (Linux only):
pip install -e ".[lingbot_va,libero]"
```
## Checkpoints
@@ -52,12 +49,12 @@ The released upstream checkpoints have been converted to LeRobot format and push
| Variant | LeRobot checkpoint |
| ---------------------- | ---------------------------------- |
| LIBERO-Long post-train | `pepijn223/lingbot_va_libero_long` |
| RoboTwin post-train | `pepijn223/lingbot_va_robotwin` |
| Pretrained base | `pepijn223/lingbot_va_base` |
| LIBERO-Long post-train | `lerobot/lingbot_va_libero_long` |
| RoboTwin post-train | `lerobot/lingbot_va_robotwin` |
| Pretrained base | `lerobot/lingbot_va_base` |
**Packaging:** only the trainable ~5B transformer is stored in the LeRobot
`model.safetensors`. The frozen VAE + UMT5 + tokenizer (~20 GB) are **lazily pulled** from
Only the trainable ~5B transformer is stored in the LeRobot
`model.safetensors`. The frozen VAE + UMT5 + tokenizer (~20 GB) are pulled from
`config.wan_pretrained_path` at load time (defaults to the source `robbyant/*` repo). The
UMT5-XXL text encoder runs on CPU by default (`config.text_encoder_device`) so the 5B
transformer + VAE fit on a single 2432 GB GPU.
@@ -74,14 +71,12 @@ lerobot-eval \
--output_dir=outputs/eval/lingbot_va_libero
```
Native LeRobot eval reproduces **96% success on `libero_10` (LIBERO-Long)** (48/50 episodes).
LingBot-VA's streaming inference (KV cache + observed-keyframe feedback) is implemented for
single-environment eval; use `--eval.batch_size=1`.
## Evaluation (RoboTwin)
RoboTwin 2.0 needs the SAPIEN + CuRobo simulator stack use the benchmark Docker image
RoboTwin 2.0 needs the SAPIEN + CuRobo simulator stack. You can use the benchmark Docker image
(`docker/Dockerfile.benchmark.robotwin`, which also needs `warp-lang==1.3.1` and CuRobo built
with the GPU's compute capability in `TORCH_CUDA_ARCH_LIST`). RoboTwin uses **end-effector-pose
control**, so run with `--env.action_mode=ee`: the policy predicts per-arm `xyz+quaternion+gripper`
@@ -132,7 +127,7 @@ The dataset must provide camera clips (a temporal window per camera, VAE-encoded
## Data format (action channels & camera order)
LingBot-VA is an **end-effector (Cartesian) pose** policy it predicts EEF poses + gripper, not
LingBot-VA is an **end-effector (Cartesian) pose** policy, it predicts EEF poses + gripper, not
joint positions. Actions live in a fixed multi-embodiment **30-dim** layout; map your robot's
action dimensions into these channels and pad the rest with `0` (`used_action_channel_ids` selects
the channels a given checkpoint actually uses):
@@ -154,7 +149,7 @@ the channels a given checkpoint actually uses):
Joint-space datasets (or a different EEF convention) must be remapped into this schema before
fine-tuning these checkpoints.
**Camera order is fixed and order-sensitive** per-camera latents are concatenated spatially in
**Camera order is fixed and order-sensitive**, per-camera latents are concatenated spatially in
`obs_cam_keys` order, so the physical camera→slot mapping must match training:
| benchmark | `obs_cam_keys` (in order) | `camera_layout` |
@@ -162,8 +157,7 @@ fine-tuning these checkpoints.
| LIBERO | `observation.images.image` (agentview / 3rd-person), `observation.images.image2` (eye-in-hand wrist) | `width_concat` (latents concatenated on width) |
| RoboTwin | `observation.images.head_camera`, `observation.images.left_camera`, `observation.images.right_camera` | `robotwin_tshape` (full-res head below, two half-res wrists on top) |
The first camera is the exterior/head view and the rest are wrist views; swapping the order (or
which physical camera maps to each slot) breaks inference.
The first camera is the exterior/head view and the rest are wrist views.
## Inference Hyperparameters (LIBERO)
@@ -180,12 +174,8 @@ which physical camera maps to each slot) breaks inference.
These are the defaults of `LingBotVAConfig`; override any of them via `--policy.<name>=...`.
## Notes & Limitations
## Notes
- **Correctness gate:** matching the upstream LIBERO success rate requires validating the
converted checkpoint on a GPU and tensor-diffing intermediate activations against the
upstream implementation. The most sensitive parts are the action quantile normalization,
the camera ordering, the `action_per_frame`/`frame_chunk_size` alignment, and `attn_mode`.
- **Attention backend:** inference uses the `torch` SDPA backend (always available). The
`flashattn` and `flex` backends are optional; `flex` is only needed for training.
- **Model size:** the DiT is ~5B params and the frozen VAE+UMT5 add ~20 GB; inference needs