diff --git a/docs/source/lingbot_va.mdx b/docs/source/lingbot_va.mdx
index 76c6d3fb7..a4405b96c 100644
--- a/docs/source/lingbot_va.mdx
+++ b/docs/source/lingbot_va.mdx
@@ -130,6 +130,41 @@ lerobot-train \
 The dataset must provide camera clips (a temporal window per camera, VAE-encoded to
 `frame_chunk_size` latent frames) and `frame_chunk_size * action_per_frame` action steps per item.
 
+## Data format (action channels & camera order)
+
+LingBot-VA is an **end-effector (Cartesian) pose** policy — it predicts EEF poses + gripper, not
+joint positions. Actions live in a fixed multi-embodiment **30-dim** layout; map your robot's
+action dimensions into these channels and pad the rest with `0` (`used_action_channel_ids` selects
+the channels a given checkpoint actually uses):
+
+| channels | meaning                                               |
+| -------- | ----------------------------------------------------- |
+| 0–6      | Left-arm end-effector pose                            |
+| 7–13     | Right-arm end-effector pose                           |
+| 14–20    | Left-arm joints (unused by the released checkpoints)  |
+| 21–27    | Right-arm joints (unused by the released checkpoints) |
+| 28       | Left gripper                                          |
+| 29       | Right gripper                                         |
+
+- **LIBERO** uses channels `0–6`: a 6-DoF EEF delta (xyz + rotation) + gripper (single arm).
+- **RoboTwin** uses channels `[0–6, 28, 7–13, 29]`: left EEF (xyz + quaternion) + left gripper +
+  right EEF + right gripper (16 dims). The env converts these poses to joint trajectories via
+  CuRobo IK — joints are never predicted.
+
+Joint-space datasets (or a different EEF convention) must be remapped into this schema before
+fine-tuning these checkpoints.
+
+**Camera order is fixed and order-sensitive** — per-camera latents are concatenated spatially in
+`obs_cam_keys` order, so the physical camera→slot mapping must match training:
+
+| benchmark | `obs_cam_keys` (in order)                                                                             | `camera_layout`                                                     |
+| --------- | ----------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------- |
+| LIBERO    | `observation.images.image` (agentview / 3rd-person), `observation.images.image2` (eye-in-hand wrist)  | `width_concat` (latents concatenated on width)                      |
+| RoboTwin  | `observation.images.head_camera`, `observation.images.left_camera`, `observation.images.right_camera` | `robotwin_tshape` (full-res head below, two half-res wrists on top) |
+
+The first camera is the exterior/head view and the rest are wrist views; swapping the order (or
+which physical camera maps to each slot) breaks inference.
+
 ## Inference Hyperparameters (LIBERO)
 
 | Key                                    | Value                                                                             |