From 5568ce7af1a2e83f001869f3f4f0b24dcd98a662 Mon Sep 17 00:00:00 2001
From: Pepijn <138571049+pkooij@users.noreply.github.com>
Date: Mon, 8 Jun 2026 10:47:34 +0200
Subject: [PATCH] Update lingbot_va.mdx

Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
---
 docs/source/lingbot_va.mdx | 36 +++++++++++++-----------------------
 1 file changed, 13 insertions(+), 23 deletions(-)

diff --git a/docs/source/lingbot_va.mdx b/docs/source/lingbot_va.mdx
index a4405b96c..54ad23ef7 100644
--- a/docs/source/lingbot_va.mdx
+++ b/docs/source/lingbot_va.mdx
@@ -11,12 +11,11 @@ interfaces.
 LingBot-VA is a **dual-stream "mixture-of-transformers"**: a video/latent stream
 (`patch_embedding_mlp → blocks → proj_out`) and an action stream
 (`action_embedder → blocks → action_proj_out`) share the same 30 transformer blocks and
-text conditioning. Actions are produced by the dedicated `action_proj_out` head — they are
-**not** decoded from predicted pixels, though video and action are co-trained.
+text conditioning.
 
 | Component                | Class                   | Role                                                                                   |
 | ------------------------ | ----------------------- | -------------------------------------------------------------------------------------- |
-| DiT backbone (trainable) | `WanTransformer3DModel` | ~5B-param dual-stream transformer (the only weights stored in the LeRobot checkpoint). |
+| DiT backbone (trainable) | `WanTransformer3DModel` | ~5B-param dual-stream transformer.                                                     |
 | VAE (frozen)             | `AutoencoderKLWan`      | Wan2.2 VAE, `z_dim=48`. Lazy-pulled from the source repo.                              |
 | Text encoder (frozen)    | `UMT5EncoderModel`      | UMT5-XXL, `d_model=4096`. Lazy-pulled from the source repo.                            |
 
@@ -38,12 +37,10 @@ fed back into the KV cache as the chunk is executed (closed-loop world modeling)
 ## Installation
 
 1. Install LeRobot by following the [Installation Guide](./installation).
-2. Install the LingBot-VA extra (brings in `diffusers>=0.36` for the Wan2.2 stack):
+2. Install the LingBot-VA extra:
 
 ```bash
 pip install -e ".[lingbot_va]"
-# For LIBERO evaluation (Linux only):
-pip install -e ".[lingbot_va,libero]"
 ```
 
 ## Checkpoints
@@ -52,12 +49,12 @@ The released upstream checkpoints have been converted to LeRobot format and push
 
 | Variant                | LeRobot checkpoint                 |
 | ---------------------- | ---------------------------------- |
-| LIBERO-Long post-train | `pepijn223/lingbot_va_libero_long` |
-| RoboTwin post-train    | `pepijn223/lingbot_va_robotwin`    |
-| Pretrained base        | `pepijn223/lingbot_va_base`        |
+| LIBERO-Long post-train | `lerobot/lingbot_va_libero_long` |
+| RoboTwin post-train    | `lerobot/lingbot_va_robotwin`    |
+| Pretrained base        | `lerobot/lingbot_va_base`        |
 
-**Packaging:** only the trainable ~5B transformer is stored in the LeRobot
-`model.safetensors`. The frozen VAE + UMT5 + tokenizer (~20 GB) are **lazily pulled** from
+Only the trainable ~5B transformer is stored in the LeRobot
+`model.safetensors`. The frozen VAE + UMT5 + tokenizer (~20 GB) are pulled from
 `config.wan_pretrained_path` at load time (defaults to the source `robbyant/*` repo). The
 UMT5-XXL text encoder runs on CPU by default (`config.text_encoder_device`) so the 5B
 transformer + VAE fit on a single 24–32 GB GPU.
@@ -74,14 +71,12 @@ lerobot-eval \
     --output_dir=outputs/eval/lingbot_va_libero
 ```
 
-Native LeRobot eval reproduces **96% success on `libero_10` (LIBERO-Long)** (48/50 episodes).
-
 LingBot-VA's streaming inference (KV cache + observed-keyframe feedback) is implemented for
 single-environment eval; use `--eval.batch_size=1`.
 
 ## Evaluation (RoboTwin)
 
-RoboTwin 2.0 needs the SAPIEN + CuRobo simulator stack — use the benchmark Docker image
+RoboTwin 2.0 needs the SAPIEN + CuRobo simulator stack. You can use the benchmark Docker image
 (`docker/Dockerfile.benchmark.robotwin`, which also needs `warp-lang==1.3.1` and CuRobo built
 with the GPU's compute capability in `TORCH_CUDA_ARCH_LIST`). RoboTwin uses **end-effector-pose
 control**, so run with `--env.action_mode=ee`: the policy predicts per-arm `xyz+quaternion+gripper`
@@ -132,7 +127,7 @@ The dataset must provide camera clips (a temporal window per camera, VAE-encoded
 
 ## Data format (action channels & camera order)
 
-LingBot-VA is an **end-effector (Cartesian) pose** policy — it predicts EEF poses + gripper, not
+LingBot-VA is an **end-effector (Cartesian) pose** policy, it predicts EEF poses + gripper, not
 joint positions. Actions live in a fixed multi-embodiment **30-dim** layout; map your robot's
 action dimensions into these channels and pad the rest with `0` (`used_action_channel_ids` selects
 the channels a given checkpoint actually uses):
@@ -154,7 +149,7 @@ the channels a given checkpoint actually uses):
 Joint-space datasets (or a different EEF convention) must be remapped into this schema before
 fine-tuning these checkpoints.
 
-**Camera order is fixed and order-sensitive** — per-camera latents are concatenated spatially in
+**Camera order is fixed and order-sensitive**, per-camera latents are concatenated spatially in
 `obs_cam_keys` order, so the physical camera→slot mapping must match training:
 
 | benchmark | `obs_cam_keys` (in order)                                                                             | `camera_layout`                                                     |
@@ -162,8 +157,7 @@ fine-tuning these checkpoints.
 | LIBERO    | `observation.images.image` (agentview / 3rd-person), `observation.images.image2` (eye-in-hand wrist)  | `width_concat` (latents concatenated on width)                      |
 | RoboTwin  | `observation.images.head_camera`, `observation.images.left_camera`, `observation.images.right_camera` | `robotwin_tshape` (full-res head below, two half-res wrists on top) |
 
-The first camera is the exterior/head view and the rest are wrist views; swapping the order (or
-which physical camera maps to each slot) breaks inference.
+The first camera is the exterior/head view and the rest are wrist views.
 
 ## Inference Hyperparameters (LIBERO)
 
@@ -180,12 +174,8 @@ which physical camera maps to each slot) breaks inference.
 
 These are the defaults of `LingBotVAConfig`; override any of them via `--policy.<name>=...`.
 
-## Notes & Limitations
+## Notes
 
-- **Correctness gate:** matching the upstream LIBERO success rate requires validating the
-  converted checkpoint on a GPU and tensor-diffing intermediate activations against the
-  upstream implementation. The most sensitive parts are the action quantile normalization,
-  the camera ordering, the `action_per_frame`/`frame_chunk_size` alignment, and `attn_mode`.
 - **Attention backend:** inference uses the `torch` SDPA backend (always available). The
   `flashattn` and `flex` backends are optional; `flex` is only needed for training.
 - **Model size:** the DiT is ~5B params and the frozen VAE+UMT5 add ~20 GB; inference needs