mirror of
https://github.com/huggingface/lerobot.git
synced 2026-07-04 08:37:10 +00:00
e275ea3960
* feat(policies): add LingBot-VA autoregressive video-action world model Port the LingBot-VA policy (Wan2.2 dual-stream video+action world model) into LeRobot, following the EO-1 / VLA-JEPA conventions. Covers inference, checkpoint conversion, and predicted-video saving (training is deferred to a follow-up PR). - Vendored Wan transformer/attention/flex/VAE/scheduler modules (key names preserved for near-identity conversion); torch SDPA default, flashattn/flex lazy-guarded. - LingBotVAConfig (registered "lingbot_va") + processor with fixed-quantile action unnormalization; full dual-stream sampling loop with CFG, two flow-matching schedulers and KV cache, mapped onto select_action with observed-keyframe feedback. - convert_lingbot_va_checkpoints.py (libero/robotwin variants): bundles the ~5B transformer, lazy-pulls the frozen VAE+UMT5 from the source repo. - Predicted-video plumbing in lerobot_eval (predicted_frames_callback; opt-in via --policy.save_predicted_video) and ConstantWithWarmupSchedulerConfig. - pyproject: widen diffusers-dep to <0.37, add lingbot_va + imageio-dep extras, add lingbot_va and (missing) eo1 to `all`. - Factory + policies/__init__ wiring, docs page + toctree, and tests. Note: the LIBERO success-rate correctness gate must be validated on a CUDA GPU with the converted checkpoint. * feat(lingbot_va): RoboTwin eef-pose eval, single-file model, Hub checkpoints Make the LingBot-VA port runnable on both LIBERO and RoboTwin and clean up the package to LeRobot conventions. - Consolidate all vendored Wan2.2 model code (transformer, attention, VAE helpers, flow-matching scheduler, grid utils, flex-attention) into a single modeling_lingbot_va.py; remove the separate wan_*/schedulers modules. - Move the fixed action (un)normalization quantiles out of the config and into the post-processor (LIBERO 7-DoF + RoboTwin 16-d eef); remove the conversion script in favour of ready-to-use LeRobot-format checkpoints on the Hub. - Fixes found via on-sim validation: undo LIBERO's 180-degree image flip (image_hflip), encode obs as a multi-frame streaming-VAE clip, reset the streaming VAE cache between episodes, run the transformer in config.dtype, lazy-load frozen VAE/UMT5 by subfolder with the text encoder on CPU. - RoboTwin: add an end-effector-pose action mode to RoboTwinEnv (16-d per-arm xyz+quat+gripper deltas composed onto the initial eef pose, executed via CuRobo IK) and the robotwin_tshape latent layout (full-res head + half-res wrists via a second streaming VAE) with the upstream RoboTwin action quantiles + camera mapping. - Predicted-video saving works for both benchmarks; docs + tests updated. * feat(lingbot_va): implement training / fine-tuning (flow-matching loss) - Implement LingBotVAPolicy.forward(): dual-stream flow-matching training loss (latent + action, timestep-weighted, action-masked) ported from upstream train.py; VAE-encodes camera clips, UMT5-encodes the task, noises both streams, runs the block-causal flex-attention training pass (forward_train). - training_loss_from_streams() core + _build_training_streams() data prep (action scatter into the 30-d space, multi-frame VAE encode incl. robotwin_tshape). - get_optim_params returns only trainable transformer params (LoRA/PEFT friendly); VAE/UMT5 stay frozen. Training needs attn_mode='flex'. - Add a tiny-config single-training-step test (forward->loss->backward->AdamW) and a Training/fine-tuning section in the docs. * fix(lingbot_va): CI quality gate + fast-test collection - Add tests/policies/lingbot_va/__init__.py so the test files don't clash by basename with tests/policies/vla_jepa/* under pytest's default import mode (fast-test collection error). - Fix vendored typos flagged by the typos hook (pach_scale->patch_scale, total_tolen-> total_token_len, stablized->stabilized) and a mypy union-attr in RoboTwinEnv._read_eef_pose. - Apply Prettier formatting to docs/source/lingbot_va.mdx. * docs(lingbot_va): document EEF action-channel schema + camera order * Update lingbot_va.mdx Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> * Update pyproject.toml Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> * Update pyproject.toml Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> * refactor(lingbot_va): drop hardcoded action quantiles; source from checkpoint The LIBERO/RoboTwin action (un)normalization quantiles were hardcoded as module constants in processor_lingbot_va.py. They are already serialized into each checkpoint's policy_postprocessor.json (via LingBotVAActionUnnormalizeStep.get_config) and restored on load by PolicyProcessorPipeline.from_pretrained, so the constants are dead at eval/load time for the released checkpoints (verified: libero_long/robotwin/base all carry their quantiles on the Hub). - Remove LIBERO_ACTION_Q01/Q99, ROBOTWIN_ACTION_Q01/Q99 and _default_action_quantiles. - make_lingbot_va_pre_post_processors now defaults a fresh (unconverted) build to a neutral [-1, 1] mapping (identity rescale); real per-benchmark stats come from the saved checkpoint (or postprocessor_overrides), analogous to dataset-stats normalization. - Update the config doc comment to point at the checkpoint as the source of truth. - Tests: replace the LIBERO-default assertion with a neutral-default check, and add a save_pretrained/from_pretrained round-trip guard for the quantile serialization. * docs(lingbot_va): trim verbose comments - configuration_lingbot_va.py: condense multi-line field comments to one-liners (keep the ── section headers). - processor_lingbot_va.py: shorten the action-quantile explanation block. - modeling_lingbot_va.py: drop the bare "# ----" separator rules, keeping the one-line section headers. No code changes. * docs(lingbot_va): trim provenance comments; default wan path to base repo - configuration_lingbot_va.py: drop the "──" decorations and the "(from transformer/config.json)" note; default wan_pretrained_path to robbyant/lingbot-va-base (has the frozen vae/text_encoder/tokenizer subfolders). - modeling_lingbot_va.py: remove the vendored-code banner and the "(upstream wan_va/...)" section-header provenance/dash decorations; condense the transformer-dtype comment to one line. No code changes. * refactor(lingbot_va): use built-in UnnormalizerProcessorStep for actions Replace the bespoke LingBotVAActionUnnormalizeStep with the standard UnnormalizerProcessorStep in QUANTILES mode, which computes the identical (action + 1) / 2 * (q99 - q01) + q01 mapping. The per-channel q01/q99 are stored as the step's saved state (a safetensors file) and restored on load; a fresh build has no action stats so the step is an identity passthrough. The 3 Hub checkpoints (lerobot/lingbot_va_{libero_long,robotwin,base}) have been re-uploaded with the new post-processor (policy_postprocessor.json + *_unnormalizer_processor.safetensors); reloading from the Hub round-trips q01/q99. - processor_lingbot_va.py: drop the custom step + registry; build the post-processor with UnnormalizerProcessorStep (explicit ACTION->QUANTILES norm_map so the preprocessor / training path is unchanged). - tests: assert the built-in step is used, identity-when-no-stats, correct quantile unnormalization, and a save_pretrained/from_pretrained stats round-trip. * docs(lingbot_va): point checkpoint paths at the lerobot org The LeRobot-format checkpoints moved from pepijn223/* to lerobot/* (libero_long, robotwin, base). Update the eval/train --policy.path examples accordingly. * docs(lingbot_va): condense processor normalization comments * fix(lingbot-va): align RoboTwin evaluation (#3784) Thank you for the RoboTwin fix, and alignment! * applying fixes * updating uv lock and linting * adjusting test to match expected values * cleaning up deps * cleaning up top level imports, styling, and deps guards * cleanup * moving wan utils and loading utils to `utils.py` * removing ftfy by replicating the prompt_clean function without it (we don't expect to have weird chars given in the prompt anyway) * removing unused function * guarding for scipy dep, renaming test to avoid collision * adding back accelerate for peak memory usage optim + justifying robotwin description dep --------- Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> Co-authored-by: pepijn223 <pepijn223@hf.co> Co-authored-by: Gangwei XU <gwxu@hust.edu.cn> Co-authored-by: Maxime Ellerbach <maxime.ellerbach@huggingface.co>
188 lines
10 KiB
Plaintext
188 lines
10 KiB
Plaintext
# LingBot-VA
|
||
|
||
LingBot-VA is an **autoregressive video-action world-model policy** built on the **Wan2.2**
|
||
video-diffusion stack. It interleaves, in one autoregressive sequence, the prediction of
|
||
future **video latents** and **robot actions** ("VA" = Video-Action). The LeRobot
|
||
integration wires LingBot-VA into the standard training, evaluation and processor
|
||
interfaces.
|
||
|
||
## Model Overview
|
||
|
||
LingBot-VA is a **dual-stream "mixture-of-transformers"**: a video/latent stream
|
||
(`patch_embedding_mlp → blocks → proj_out`) and an action stream
|
||
(`action_embedder → blocks → action_proj_out`) share the same 30 transformer blocks and
|
||
text conditioning.
|
||
|
||
| Component | Class | Role |
|
||
| ------------------------ | ----------------------- | ----------------------------------------------------------- |
|
||
| DiT backbone (trainable) | `WanTransformer3DModel` | ~5B-param dual-stream transformer. |
|
||
| VAE (frozen) | `AutoencoderKLWan` | Wan2.2 VAE, `z_dim=48`. Lazy-pulled from the source repo. |
|
||
| Text encoder (frozen) | `UMT5EncoderModel` | UMT5-XXL, `d_model=4096`. Lazy-pulled from the source repo. |
|
||
|
||
At inference the policy runs an autoregressive loop per chunk: it denoises the video-latent
|
||
stream (CFG, ~20 steps) and the action stream (~50 steps) with two independent
|
||
flow-matching schedulers, maintaining a KV cache across chunks. Real observed keyframes are
|
||
fed back into the KV cache as the chunk is executed (closed-loop world modeling).
|
||
|
||
### What the LeRobot Integration Covers
|
||
|
||
- Standard `policy.type=lingbot_va` configuration through LeRobot.
|
||
- Ready-to-use LeRobot-format checkpoints on the Hub (converted from the released upstream ones).
|
||
- Autoregressive dual-stream inference behind the standard `select_action` interface
|
||
(single-environment eval, `--eval.batch_size=1`).
|
||
- Opt-in saving of the policy's **predicted (imagined) videos** during eval / training.
|
||
- Evaluation with `lerobot-eval` on LIBERO and RoboTwin.
|
||
- Training / fine-tuning via the dual-stream flow-matching loss (`policy.forward`), see below.
|
||
|
||
## Installation
|
||
|
||
1. Install LeRobot by following the [Installation Guide](./installation).
|
||
2. Install the LingBot-VA extra:
|
||
|
||
```bash
|
||
pip install -e ".[lingbot_va]"
|
||
```
|
||
|
||
## Checkpoints
|
||
|
||
The released upstream checkpoints have been converted to LeRobot format and pushed to the Hub:
|
||
|
||
| Variant | LeRobot checkpoint |
|
||
| ---------------------- | -------------------------------- |
|
||
| LIBERO-Long post-train | `lerobot/lingbot_va_libero_long` |
|
||
| RoboTwin post-train | `lerobot/lingbot_va_robotwin` |
|
||
| Pretrained base | `lerobot/lingbot_va_base` |
|
||
|
||
Only the trainable ~5B transformer is stored in the LeRobot
|
||
`model.safetensors`. The frozen VAE + UMT5 + tokenizer (~20 GB) are pulled from
|
||
`config.wan_pretrained_path` at load time (defaults to the source `robbyant/*` repo). The
|
||
UMT5-XXL text encoder runs on CPU by default (`config.text_encoder_device`) so the 5B
|
||
transformer + VAE fit on a single 24–32 GB GPU.
|
||
|
||
## Evaluation (LIBERO)
|
||
|
||
```bash
|
||
lerobot-eval \
|
||
--policy.path=lerobot/lingbot_va_libero_long \
|
||
--policy.device=cuda \
|
||
--env.type=libero --env.task=libero_10 \
|
||
--env.observation_height=128 --env.observation_width=128 \
|
||
--eval.n_episodes=50 --eval.batch_size=1 \
|
||
--output_dir=outputs/eval/lingbot_va_libero
|
||
```
|
||
|
||
LingBot-VA's streaming inference (KV cache + observed-keyframe feedback) is implemented for
|
||
single-environment eval; use `--eval.batch_size=1`.
|
||
|
||
## Evaluation (RoboTwin)
|
||
|
||
RoboTwin 2.0 needs the SAPIEN + CuRobo simulator stack. You can use the benchmark Docker image
|
||
(`docker/Dockerfile.benchmark.robotwin`, which also needs `warp-lang==1.3.1` and CuRobo built
|
||
with the GPU's compute capability in `TORCH_CUDA_ARCH_LIST`). RoboTwin uses **end-effector-pose
|
||
control**, so run with `--env.action_mode=ee`: the policy predicts per-arm `xyz+quaternion+gripper`
|
||
deltas (`robotwin_tshape` latent layout) that are composed onto the episode's initial eef pose and
|
||
executed via CuRobo IK.
|
||
|
||
```bash
|
||
lerobot-eval \
|
||
--policy.path=lerobot/lingbot_va_robotwin \
|
||
--policy.device=cuda \
|
||
--env.type=robotwin --env.task=beat_block_hammer --env.action_mode=ee \
|
||
--eval.n_episodes=10 --eval.batch_size=1 \
|
||
--output_dir=outputs/eval/lingbot_va_robotwin
|
||
```
|
||
|
||
### Saving predicted (imagined) videos
|
||
|
||
Set `--policy.save_predicted_video=true` to additionally VAE-decode the predicted video
|
||
latents and write `pred_episode_*.mp4` next to the env-rendered `eval_episode_*.mp4` videos.
|
||
The same flag works for the periodic eval during `lerobot-train`.
|
||
|
||
## Training / fine-tuning
|
||
|
||
`LingBotVAPolicy.forward(batch)` implements the dual-stream **flow-matching** loss
|
||
(`latent_loss + action_loss`, timestep-weighted, action-masked) from the paper: it VAE-encodes
|
||
the camera clips into video latents, UMT5-encodes the task, noises both streams, runs the
|
||
transformer's block-causal training pass and returns `(loss, metrics)`. Optimizer preset is AdamW
|
||
with a linear-warmup-then-constant schedule (matching upstream).
|
||
|
||
Requirements:
|
||
|
||
- The block-causal masks use PyTorch **flex-attention**, so build the policy with
|
||
`--policy.attn_mode=flex` for training (the default `torch` SDPA is inference-only).
|
||
- The full 5B DiT does not fit a single 24–32 GB GPU under AdamW; fine-tune with **LoRA**
|
||
(`--policy.use_peft=true`) and/or optimizer offload. `get_optim_params` returns only the
|
||
trainable (e.g. adapter) parameters; the VAE + UMT5 text encoder stay frozen.
|
||
|
||
```bash
|
||
lerobot-train \
|
||
--policy.path=lerobot/lingbot_va_libero_long --policy.attn_mode=flex \
|
||
--policy.use_peft=true \
|
||
--dataset.repo_id=<your LeRobot-format dataset> \
|
||
--batch_size=1 --steps=... --output_dir=outputs/train/lingbot_va
|
||
```
|
||
|
||
The dataset must provide camera clips (a temporal window per camera, VAE-encoded to
|
||
`frame_chunk_size` latent frames) and `frame_chunk_size * action_per_frame` action steps per item.
|
||
|
||
## Data format (action channels & camera order)
|
||
|
||
LingBot-VA is an **end-effector (Cartesian) pose** policy, it predicts EEF poses + gripper, not
|
||
joint positions. Actions live in a fixed multi-embodiment **30-dim** layout; map your robot's
|
||
action dimensions into these channels and pad the rest with `0` (`used_action_channel_ids` selects
|
||
the channels a given checkpoint actually uses):
|
||
|
||
| channels | meaning |
|
||
| -------- | ----------------------------------------------------- |
|
||
| 0–6 | Left-arm end-effector pose |
|
||
| 7–13 | Right-arm end-effector pose |
|
||
| 14–20 | Left-arm joints (unused by the released checkpoints) |
|
||
| 21–27 | Right-arm joints (unused by the released checkpoints) |
|
||
| 28 | Left gripper |
|
||
| 29 | Right gripper |
|
||
|
||
- **LIBERO** uses channels `0–6`: a 6-DoF EEF delta (xyz + rotation) + gripper (single arm).
|
||
- **RoboTwin** uses channels `[0–6, 28, 7–13, 29]`: left EEF (xyz + quaternion) + left gripper +
|
||
right EEF + right gripper (16 dims). The env converts these poses to joint trajectories via
|
||
CuRobo IK — joints are never predicted.
|
||
|
||
Joint-space datasets (or a different EEF convention) must be remapped into this schema before
|
||
fine-tuning these checkpoints.
|
||
|
||
**Camera order is fixed and order-sensitive**, per-camera latents are concatenated spatially in
|
||
`obs_cam_keys` order, so the physical camera→slot mapping must match training:
|
||
|
||
| benchmark | `obs_cam_keys` (in order) | `camera_layout` |
|
||
| --------- | ----------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------- |
|
||
| LIBERO | `observation.images.image` (agentview / 3rd-person), `observation.images.image2` (eye-in-hand wrist) | `width_concat` (latents concatenated on width) |
|
||
| RoboTwin | `observation.images.head_camera`, `observation.images.left_camera`, `observation.images.right_camera` | `robotwin_tshape` (full-res head below, two half-res wrists on top) |
|
||
|
||
The first camera is the exterior/head view and the rest are wrist views.
|
||
|
||
## Inference Hyperparameters (LIBERO)
|
||
|
||
| Key | Value |
|
||
| -------------------------------------- | --------------------------------------------------------------------------------- |
|
||
| height × width | 128 × 128 |
|
||
| cameras | `observation.images.image` (agentview), `observation.images.image2` (eye-in-hand) |
|
||
| action channels used | 0–6 (7-DoF arm + gripper) |
|
||
| action_per_frame / frame_chunk_size | 4 / 4 |
|
||
| attn_window | 30 |
|
||
| video / action denoising steps | 20 / 50 |
|
||
| guidance_scale / action_guidance_scale | 5 / 1 |
|
||
| snr_shift / action_snr_shift | 5.0 / 0.05 |
|
||
|
||
These are the defaults of `LingBotVAConfig`; override any of them via `--policy.<name>=...`.
|
||
|
||
## Notes
|
||
|
||
- **Attention backend:** inference uses the `torch` SDPA backend (always available). The
|
||
`flashattn` and `flex` backends are optional; `flex` is only needed for training.
|
||
- **Model size:** the DiT is ~5B params and the frozen VAE+UMT5 add ~20 GB; inference needs
|
||
roughly 18–24 GB of VRAM.
|
||
|
||
## License
|
||
|
||
LingBot-VA is released under Apache-2.0. See the
|
||
[upstream repository](https://github.com/Robbyant/lingbot-va).
|