lerobot

admin/lerobot

Fork 0

mirror of https://github.com/huggingface/lerobot.git synced 2026-06-18 16:57:12 +00:00

Commit Graph

Author	SHA1	Message	Date
pepijn223	71aacda05e	feat(lingbot_va): implement training / fine-tuning (flow-matching loss) - Implement LingBotVAPolicy.forward(): dual-stream flow-matching training loss (latent + action, timestep-weighted, action-masked) ported from upstream train.py; VAE-encodes camera clips, UMT5-encodes the task, noises both streams, runs the block-causal flex-attention training pass (forward_train). - training_loss_from_streams() core + _build_training_streams() data prep (action scatter into the 30-d space, multi-frame VAE encode incl. robotwin_tshape). - get_optim_params returns only trainable transformer params (LoRA/PEFT friendly); VAE/UMT5 stay frozen. Training needs attn_mode='flex'. - Add a tiny-config single-training-step test (forward->loss->backward->AdamW) and a Training/fine-tuning section in the docs. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-06 15:38:41 +02:00
pepijn223	e3deff00ad	feat(lingbot_va): RoboTwin eef-pose eval, single-file model, Hub checkpoints Make the LingBot-VA port runnable on both LIBERO and RoboTwin and clean up the package to LeRobot conventions. - Consolidate all vendored Wan2.2 model code (transformer, attention, VAE helpers, flow-matching scheduler, grid utils, flex-attention) into a single modeling_lingbot_va.py; remove the separate wan_*/schedulers modules. - Move the fixed action (un)normalization quantiles out of the config and into the post-processor (LIBERO 7-DoF + RoboTwin 16-d eef); remove the conversion script in favour of ready-to-use LeRobot-format checkpoints on the Hub. - Fixes found via on-sim validation: undo LIBERO's 180-degree image flip (image_hflip), encode obs as a multi-frame streaming-VAE clip, reset the streaming VAE cache between episodes, run the transformer in config.dtype, lazy-load frozen VAE/UMT5 by subfolder with the text encoder on CPU. - RoboTwin: add an end-effector-pose action mode to RoboTwinEnv (16-d per-arm xyz+quat+gripper deltas composed onto the initial eef pose, executed via CuRobo IK) and the robotwin_tshape latent layout (full-res head + half-res wrists via a second streaming VAE) with the upstream RoboTwin action quantiles + camera mapping. - Predicted-video saving works for both benchmarks; docs + tests updated. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-06 15:20:51 +02:00
Pepijn	4dfa8cea65	feat(policies): add LingBot-VA autoregressive video-action world model Port the LingBot-VA policy (Wan2.2 dual-stream video+action world model) into LeRobot, following the EO-1 / VLA-JEPA conventions. Covers inference, checkpoint conversion, and predicted-video saving (training is deferred to a follow-up PR). - Vendored Wan transformer/attention/flex/VAE/scheduler modules (key names preserved for near-identity conversion); torch SDPA default, flashattn/flex lazy-guarded. - LingBotVAConfig (registered "lingbot_va") + processor with fixed-quantile action unnormalization; full dual-stream sampling loop with CFG, two flow-matching schedulers and KV cache, mapped onto select_action with observed-keyframe feedback. - convert_lingbot_va_checkpoints.py (libero/robotwin variants): bundles the ~5B transformer, lazy-pulls the frozen VAE+UMT5 from the source repo. - Predicted-video plumbing in lerobot_eval (predicted_frames_callback; opt-in via --policy.save_predicted_video) and ConstantWithWarmupSchedulerConfig. - pyproject: widen diffusers-dep to <0.37, add lingbot_va + imageio-dep extras, add lingbot_va and (missing) eo1 to `all`. - Factory + policies/__init__ wiring, docs page + toctree, and tests. Note: the LIBERO success-rate correctness gate must be validated on a CUDA GPU with the converted checkpoint. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 16:28:19 +02:00

Author

SHA1

Message

Date

pepijn223

71aacda05e

feat(lingbot_va): implement training / fine-tuning (flow-matching loss)

- Implement LingBotVAPolicy.forward(): dual-stream flow-matching training loss
  (latent + action, timestep-weighted, action-masked) ported from upstream train.py;
  VAE-encodes camera clips, UMT5-encodes the task, noises both streams, runs the
  block-causal flex-attention training pass (forward_train).
- training_loss_from_streams() core + _build_training_streams() data prep (action
  scatter into the 30-d space, multi-frame VAE encode incl. robotwin_tshape).
- get_optim_params returns only trainable transformer params (LoRA/PEFT friendly);
  VAE/UMT5 stay frozen. Training needs attn_mode='flex'.
- Add a tiny-config single-training-step test (forward->loss->backward->AdamW) and a
  Training/fine-tuning section in the docs.

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-06-06 15:38:41 +02:00

pepijn223

e3deff00ad

feat(lingbot_va): RoboTwin eef-pose eval, single-file model, Hub checkpoints

Make the LingBot-VA port runnable on both LIBERO and RoboTwin and clean up the
package to LeRobot conventions.

- Consolidate all vendored Wan2.2 model code (transformer, attention, VAE helpers,
  flow-matching scheduler, grid utils, flex-attention) into a single
  modeling_lingbot_va.py; remove the separate wan_*/schedulers modules.
- Move the fixed action (un)normalization quantiles out of the config and into the
  post-processor (LIBERO 7-DoF + RoboTwin 16-d eef); remove the conversion script in
  favour of ready-to-use LeRobot-format checkpoints on the Hub.
- Fixes found via on-sim validation: undo LIBERO's 180-degree image flip
  (image_hflip), encode obs as a multi-frame streaming-VAE clip, reset the streaming
  VAE cache between episodes, run the transformer in config.dtype, lazy-load frozen
  VAE/UMT5 by subfolder with the text encoder on CPU.
- RoboTwin: add an end-effector-pose action mode to RoboTwinEnv (16-d per-arm
  xyz+quat+gripper deltas composed onto the initial eef pose, executed via CuRobo IK)
  and the robotwin_tshape latent layout (full-res head + half-res wrists via a second
  streaming VAE) with the upstream RoboTwin action quantiles + camera mapping.
- Predicted-video saving works for both benchmarks; docs + tests updated.

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-06-06 15:20:51 +02:00

Pepijn

4dfa8cea65

feat(policies): add LingBot-VA autoregressive video-action world model

Port the LingBot-VA policy (Wan2.2 dual-stream video+action world model) into
LeRobot, following the EO-1 / VLA-JEPA conventions. Covers inference, checkpoint
conversion, and predicted-video saving (training is deferred to a follow-up PR).

- Vendored Wan transformer/attention/flex/VAE/scheduler modules (key names preserved
  for near-identity conversion); torch SDPA default, flashattn/flex lazy-guarded.
- LingBotVAConfig (registered "lingbot_va") + processor with fixed-quantile action
  unnormalization; full dual-stream sampling loop with CFG, two flow-matching
  schedulers and KV cache, mapped onto select_action with observed-keyframe feedback.
- convert_lingbot_va_checkpoints.py (libero/robotwin variants): bundles the ~5B
  transformer, lazy-pulls the frozen VAE+UMT5 from the source repo.
- Predicted-video plumbing in lerobot_eval (predicted_frames_callback; opt-in via
  --policy.save_predicted_video) and ConstantWithWarmupSchedulerConfig.
- pyproject: widen diffusers-dep to <0.37, add lingbot_va + imageio-dep extras,
  add lingbot_va and (missing) eo1 to `all`.
- Factory + policies/__init__ wiring, docs page + toctree, and tests.

Note: the LIBERO success-rate correctness gate must be validated on a CUDA GPU
with the converted checkpoint.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-05 16:28:19 +02:00

3 Commits