feat(lingbot_va): implement training / fine-tuning (flow-matching loss)

- Implement LingBotVAPolicy.forward(): dual-stream flow-matching training loss (latent + action, timestep-weighted, action-masked) ported from upstream train.py; VAE-encodes camera clips, UMT5-encodes the task, noises both streams, runs the block-causal flex-attention training pass (forward_train). - training_loss_from_streams() core + _build_training_streams() data prep (action scatter into the 30-d space, multi-frame VAE encode incl. robotwin_tshape). - get_optim_params returns only trainable transformer params (LoRA/PEFT friendly); VAE/UMT5 stay frozen. Training needs attn_mode='flex'. - Add a tiny-config single-training-step test (forward->loss->backward->AdamW) and a Training/fine-tuning section in the docs. Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-18 00:37:10 +00:00 · 2026-06-06 15:38:41 +02:00
parent e3deff00ad
commit 71aacda05e
3 changed files with 305 additions and 16 deletions
@@ -32,10 +32,8 @@ fed back into the KV cache as the chunk is executed (closed-loop world modeling)
 - Autoregressive dual-stream inference behind the standard `select_action` interface
  (single-environment eval, `--eval.batch_size=1`).
 - Opt-in saving of the policy's **predicted (imagined) videos** during eval / training.
- Evaluation with `lerobot-eval` on the LIBERO benchmark.
-
-Training (the flow-matching dual-stream loss + latent dataset) is part of a follow-up
-training port and is not yet wired into `lerobot-train`.
+- Evaluation with `lerobot-eval` on LIBERO and RoboTwin.
+- Training / fine-tuning via the dual-stream flow-matching loss (`policy.forward`), see below.

 ## Installation

@@ -105,6 +103,32 @@ Set `--policy.save_predicted_video=true` to additionally VAE-decode the predicte
 latents and write `pred_episode_*.mp4` next to the env-rendered `eval_episode_*.mp4` videos.
 The same flag works for the periodic eval during `lerobot-train`.

+## Training / fine-tuning
+
+`LingBotVAPolicy.forward(batch)` implements the dual-stream **flow-matching** loss
+(`latent_loss + action_loss`, timestep-weighted, action-masked) from the paper: it VAE-encodes
+the camera clips into video latents, UMT5-encodes the task, noises both streams, runs the
+transformer's block-causal training pass and returns `(loss, metrics)`. Optimizer preset is AdamW
+with a linear-warmup-then-constant schedule (matching upstream).
+
+Requirements:
+- The block-causal masks use PyTorch **flex-attention**, so build the policy with
+  `--policy.attn_mode=flex` for training (the default `torch` SDPA is inference-only).
+- The full 5B DiT does not fit a single 24–32 GB GPU under AdamW; fine-tune with **LoRA**
+  (`--policy.use_peft=true`) and/or optimizer offload. `get_optim_params` returns only the
+  trainable (e.g. adapter) parameters; the VAE + UMT5 text encoder stay frozen.
+
+```bash
+lerobot-train \
+  --policy.path=pepijn223/lingbot_va_libero_long --policy.attn_mode=flex \
+  --policy.use_peft=true \
+  --dataset.repo_id=<your LeRobot-format dataset> \
+  --batch_size=1 --steps=... --output_dir=outputs/train/lingbot_va
+```
+
+The dataset must provide camera clips (a temporal window per camera, VAE-encoded to
+`frame_chunk_size` latent frames) and `frame_chunk_size * action_per_frame` action steps per item.
+
 ## Inference Hyperparameters (LIBERO)

 | Key | Value |