From 5568ce7af1a2e83f001869f3f4f0b24dcd98a662 Mon Sep 17 00:00:00 2001 From: Pepijn <138571049+pkooij@users.noreply.github.com> Date: Mon, 8 Jun 2026 10:47:34 +0200 Subject: [PATCH] Update lingbot_va.mdx Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> --- docs/source/lingbot_va.mdx | 36 +++++++++++++----------------------- 1 file changed, 13 insertions(+), 23 deletions(-) diff --git a/docs/source/lingbot_va.mdx b/docs/source/lingbot_va.mdx index a4405b96c..54ad23ef7 100644 --- a/docs/source/lingbot_va.mdx +++ b/docs/source/lingbot_va.mdx @@ -11,12 +11,11 @@ interfaces. LingBot-VA is a **dual-stream "mixture-of-transformers"**: a video/latent stream (`patch_embedding_mlp → blocks → proj_out`) and an action stream (`action_embedder → blocks → action_proj_out`) share the same 30 transformer blocks and -text conditioning. Actions are produced by the dedicated `action_proj_out` head — they are -**not** decoded from predicted pixels, though video and action are co-trained. +text conditioning. | Component | Class | Role | | ------------------------ | ----------------------- | -------------------------------------------------------------------------------------- | -| DiT backbone (trainable) | `WanTransformer3DModel` | ~5B-param dual-stream transformer (the only weights stored in the LeRobot checkpoint). | +| DiT backbone (trainable) | `WanTransformer3DModel` | ~5B-param dual-stream transformer. | | VAE (frozen) | `AutoencoderKLWan` | Wan2.2 VAE, `z_dim=48`. Lazy-pulled from the source repo. | | Text encoder (frozen) | `UMT5EncoderModel` | UMT5-XXL, `d_model=4096`. Lazy-pulled from the source repo. | @@ -38,12 +37,10 @@ fed back into the KV cache as the chunk is executed (closed-loop world modeling) ## Installation 1. Install LeRobot by following the [Installation Guide](./installation). -2. Install the LingBot-VA extra (brings in `diffusers>=0.36` for the Wan2.2 stack): +2. Install the LingBot-VA extra: ```bash pip install -e ".[lingbot_va]" -# For LIBERO evaluation (Linux only): -pip install -e ".[lingbot_va,libero]" ``` ## Checkpoints @@ -52,12 +49,12 @@ The released upstream checkpoints have been converted to LeRobot format and push | Variant | LeRobot checkpoint | | ---------------------- | ---------------------------------- | -| LIBERO-Long post-train | `pepijn223/lingbot_va_libero_long` | -| RoboTwin post-train | `pepijn223/lingbot_va_robotwin` | -| Pretrained base | `pepijn223/lingbot_va_base` | +| LIBERO-Long post-train | `lerobot/lingbot_va_libero_long` | +| RoboTwin post-train | `lerobot/lingbot_va_robotwin` | +| Pretrained base | `lerobot/lingbot_va_base` | -**Packaging:** only the trainable ~5B transformer is stored in the LeRobot -`model.safetensors`. The frozen VAE + UMT5 + tokenizer (~20 GB) are **lazily pulled** from +Only the trainable ~5B transformer is stored in the LeRobot +`model.safetensors`. The frozen VAE + UMT5 + tokenizer (~20 GB) are pulled from `config.wan_pretrained_path` at load time (defaults to the source `robbyant/*` repo). The UMT5-XXL text encoder runs on CPU by default (`config.text_encoder_device`) so the 5B transformer + VAE fit on a single 24–32 GB GPU. @@ -74,14 +71,12 @@ lerobot-eval \ --output_dir=outputs/eval/lingbot_va_libero ``` -Native LeRobot eval reproduces **96% success on `libero_10` (LIBERO-Long)** (48/50 episodes). - LingBot-VA's streaming inference (KV cache + observed-keyframe feedback) is implemented for single-environment eval; use `--eval.batch_size=1`. ## Evaluation (RoboTwin) -RoboTwin 2.0 needs the SAPIEN + CuRobo simulator stack — use the benchmark Docker image +RoboTwin 2.0 needs the SAPIEN + CuRobo simulator stack. You can use the benchmark Docker image (`docker/Dockerfile.benchmark.robotwin`, which also needs `warp-lang==1.3.1` and CuRobo built with the GPU's compute capability in `TORCH_CUDA_ARCH_LIST`). RoboTwin uses **end-effector-pose control**, so run with `--env.action_mode=ee`: the policy predicts per-arm `xyz+quaternion+gripper` @@ -132,7 +127,7 @@ The dataset must provide camera clips (a temporal window per camera, VAE-encoded ## Data format (action channels & camera order) -LingBot-VA is an **end-effector (Cartesian) pose** policy — it predicts EEF poses + gripper, not +LingBot-VA is an **end-effector (Cartesian) pose** policy, it predicts EEF poses + gripper, not joint positions. Actions live in a fixed multi-embodiment **30-dim** layout; map your robot's action dimensions into these channels and pad the rest with `0` (`used_action_channel_ids` selects the channels a given checkpoint actually uses): @@ -154,7 +149,7 @@ the channels a given checkpoint actually uses): Joint-space datasets (or a different EEF convention) must be remapped into this schema before fine-tuning these checkpoints. -**Camera order is fixed and order-sensitive** — per-camera latents are concatenated spatially in +**Camera order is fixed and order-sensitive**, per-camera latents are concatenated spatially in `obs_cam_keys` order, so the physical camera→slot mapping must match training: | benchmark | `obs_cam_keys` (in order) | `camera_layout` | @@ -162,8 +157,7 @@ fine-tuning these checkpoints. | LIBERO | `observation.images.image` (agentview / 3rd-person), `observation.images.image2` (eye-in-hand wrist) | `width_concat` (latents concatenated on width) | | RoboTwin | `observation.images.head_camera`, `observation.images.left_camera`, `observation.images.right_camera` | `robotwin_tshape` (full-res head below, two half-res wrists on top) | -The first camera is the exterior/head view and the rest are wrist views; swapping the order (or -which physical camera maps to each slot) breaks inference. +The first camera is the exterior/head view and the rest are wrist views. ## Inference Hyperparameters (LIBERO) @@ -180,12 +174,8 @@ which physical camera maps to each slot) breaks inference. These are the defaults of `LingBotVAConfig`; override any of them via `--policy.=...`. -## Notes & Limitations +## Notes -- **Correctness gate:** matching the upstream LIBERO success rate requires validating the - converted checkpoint on a GPU and tensor-diffing intermediate activations against the - upstream implementation. The most sensitive parts are the action quantile normalization, - the camera ordering, the `action_per_frame`/`frame_chunk_size` alignment, and `attn_mode`. - **Attention backend:** inference uses the `torch` SDPA backend (always available). The `flashattn` and `flex` backends are optional; `flex` is only needed for training. - **Model size:** the DiT is ~5B params and the frozen VAE+UMT5 add ~20 GB; inference needs