- Implement LingBotVAPolicy.forward(): dual-stream flow-matching training loss
(latent + action, timestep-weighted, action-masked) ported from upstream train.py;
VAE-encodes camera clips, UMT5-encodes the task, noises both streams, runs the
block-causal flex-attention training pass (forward_train).
- training_loss_from_streams() core + _build_training_streams() data prep (action
scatter into the 30-d space, multi-frame VAE encode incl. robotwin_tshape).
- get_optim_params returns only trainable transformer params (LoRA/PEFT friendly);
VAE/UMT5 stay frozen. Training needs attn_mode='flex'.
- Add a tiny-config single-training-step test (forward->loss->backward->AdamW) and a
Training/fine-tuning section in the docs.
Co-authored-by: Cursor <cursoragent@cursor.com>
Make the LingBot-VA port runnable on both LIBERO and RoboTwin and clean up the
package to LeRobot conventions.
- Consolidate all vendored Wan2.2 model code (transformer, attention, VAE helpers,
flow-matching scheduler, grid utils, flex-attention) into a single
modeling_lingbot_va.py; remove the separate wan_*/schedulers modules.
- Move the fixed action (un)normalization quantiles out of the config and into the
post-processor (LIBERO 7-DoF + RoboTwin 16-d eef); remove the conversion script in
favour of ready-to-use LeRobot-format checkpoints on the Hub.
- Fixes found via on-sim validation: undo LIBERO's 180-degree image flip
(image_hflip), encode obs as a multi-frame streaming-VAE clip, reset the streaming
VAE cache between episodes, run the transformer in config.dtype, lazy-load frozen
VAE/UMT5 by subfolder with the text encoder on CPU.
- RoboTwin: add an end-effector-pose action mode to RoboTwinEnv (16-d per-arm
xyz+quat+gripper deltas composed onto the initial eef pose, executed via CuRobo IK)
and the robotwin_tshape latent layout (full-res head + half-res wrists via a second
streaming VAE) with the upstream RoboTwin action quantiles + camera mapping.
- Predicted-video saving works for both benchmarks; docs + tests updated.
Co-authored-by: Cursor <cursoragent@cursor.com>
Port the LingBot-VA policy (Wan2.2 dual-stream video+action world model) into
LeRobot, following the EO-1 / VLA-JEPA conventions. Covers inference, checkpoint
conversion, and predicted-video saving (training is deferred to a follow-up PR).
- Vendored Wan transformer/attention/flex/VAE/scheduler modules (key names preserved
for near-identity conversion); torch SDPA default, flashattn/flex lazy-guarded.
- LingBotVAConfig (registered "lingbot_va") + processor with fixed-quantile action
unnormalization; full dual-stream sampling loop with CFG, two flow-matching
schedulers and KV cache, mapped onto select_action with observed-keyframe feedback.
- convert_lingbot_va_checkpoints.py (libero/robotwin variants): bundles the ~5B
transformer, lazy-pulls the frozen VAE+UMT5 from the source repo.
- Predicted-video plumbing in lerobot_eval (predicted_frames_callback; opt-in via
--policy.save_predicted_video) and ConstantWithWarmupSchedulerConfig.
- pyproject: widen diffusers-dep to <0.37, add lingbot_va + imageio-dep extras,
add lingbot_va and (missing) eo1 to `all`.
- Factory + policies/__init__ wiring, docs page + toctree, and tests.
Note: the LIBERO success-rate correctness gate must be validated on a CUDA GPU
with the converted checkpoint.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>