Isaac-GR00T crops a random crop_fraction window during training and the
deterministic center window at eval, replaying the sampled window across all
camera views of a sample. This contract is unchanged since the N1.5 release
(gr00t/data/transform/video.py: "If mode is 'train', return a random crop
transform. If mode is 'eval', return a center crop transform.") and mirrors
LeRobot's own Diffusion/VQBeT crop_is_random pattern. The LeRobot N1.7 port
used the eval center crop for training too, so the fine-tuned projector/DiT
never sees frame borders and trains on a single fixed appearance point.
Scope: crop geometry ONLY - no color jitter, no new dependencies. The random
window is plain numpy slicing inside the existing cv2 eval transform:
- _transform_n1_7_image_for_vlm_albumentations gains crop_position=(y, x)
fractions; None keeps the center crop byte-identical to before (verified
by test)
- GrootN17VLMEncodeStep gains a runtime-only 'training' flag (never
serialized; reloaded pipelines default to eval); training samples ONE
window per sample and reuses it across (timestep, view) frames - Isaac's
cross-view consistency
- gated on torch.is_grad_enabled() so no_grad validation and frozen-eval
paths are unaffected
- wired via dataset_meta is not None in make_groot_pre_post_processors and
the existing _set_groot_preprocessor_training on serialized reloads
Verification: tests/policies/groot/test_groot_train_random_crop.py (8 passed:
center-crop bit-exactness with crop_position=None, corner/center windows,
cross-view replay, train!=eval, no_grad gating, seed reproducibility,
serialization contract) + groot suite 23 passed / 5 skipped on RTX PRO 6000 /
CUDA 13.3.
Evidence from the LeRobot-vs-OSS checkpoint comparison: the LeRobot/HF 8k
checkpoint's DiT moved only ~19% as far from base as the OSS-trained one
(0.0547 vs 0.285 relative L2) - undertrained because the scheduler decayed over
a hardcoded 10k steps regardless of --steps, on top of beta1/clip mismatches.
- AdamW betas (0.95, 0.999) -> (0.9, 0.999) and grad_clip_norm 10.0 -> 1.0
(Isaac defaults)
- scheduler: hardcoded CosineDecayWithWarmup(10k decay, floor 10% peak) ->
DiffuserSchedulerConfig HF cosine with ceil(max_steps * warmup_ratio) warmup,
deriving num_training_steps from the outer --steps at runtime
- model_params_fp32 (default true): keep master weights in FP32 and compute
under BF16 autocast like the native N1.7 recipe (fixes optimizer-update
numerics vs pure-BF16 params)
- weight-decay grouping via transformers get_parameter_names: biases and norm
parameters excluded from decay
- restore the TF4 lm_head/embedding weight tie so the unused Qwen LM head stays
frozen and deduplicated in checkpoints
- action_mask kept in native dtype for the masked flow-matching loss
- drop_n_last_frames: exclude episode tails that cannot supply a complete
action chunk (Isaac sampler behavior)
Verification: tests/policies/groot/test_groot_training_optim_contract.py
(7 passed) + remaining groot suite 11 passed/5 skipped on RTX PRO 6000 /
CUDA 13.3. Note: tests/policies/groot/test_groot_n1_7.py does not collect on
the base branch (pre-existing ImportError, fixed in PR #37).
Isaac-GR00T applies dual state regularization during fine-tuning: raw-state
zeroing driven by the processor sidecar's state_dropout_prob (0.2 for the
inspected N1.7 checkpoint) plus encoded-feature dropout. Baseline LeRobot kept
the processor in deterministic mode, so the raw-state dropout never activated
(RCA Tier-2 contributor to the LeRobot-trained SO-101 failures).
- GrootN17PackInputsStep: runtime-only 'training' flag + state_dropout_prob;
whole-sample state zeroing gated on torch.is_grad_enabled() so eval and
no_grad validation paths are unaffected
- sidecar loader reads state_dropout_prob from processor_config.json
- state_dropout_prob serializes with the step; the training flag intentionally
does not (reloaded pipelines default to eval, re-enabled only when processors
are rebuilt with dataset_meta)
- _set_groot_preprocessor_training toggles any dataclass step exposing a
'training' field on serialized-pipeline reloads
Verification: tests/policies/groot/test_groot_state_dropout.py (4 passed) on
RTX PRO 6000 / CUDA 13.3.
GR00T training was dataloader-bound (0->100->0 GPU-utilization sawtooth).
GrootN17VLMEncodeStep ran the Qwen3-VL image processor per frame on PIL images
on the single CPU main-loop thread, and that cost is timed inside dataloading_s
(preprocessor(batch) runs in the main process, not the dataloader workers), so
adding workers cannot hide it.
- Feed the torchvision-backed Qwen3-VL processor (C,H,W) uint8 tensors instead
of a per-frame Image.fromarray PIL roundtrip, and run resize/normalize/patchify
on config.device (GPU) when available. Bit-identical on CPU when no resize is
configured; with a resize only the PIL->torchvision bicubic backend differs
(<2/255 per pixel). The use_albumentations path stays PIL/cv2; reload on a box
without the saved device falls back to CPU.
- Default image_target_size/crop to the N1.7 backbone's training geometry
(256x256 / 230x230) when a checkpoint ships no image sizing (checkpoint_assets
is None, e.g. finetuning nvidia/GR00T-N1.7-3B via repo-id with a new
embodiment). Previously image_target_size=None disabled the resize, so
full-resolution frames were patchified into ~4.7x more vision tokens than the
model was trained on -- inflating dataloading_s (patchify) and update_s (VLM
sequence) and skewing the input distribution. Checkpoints that pin their own
sizing are honored; the default constants are shared with GR00T_N1_7_DEFAULTS.
Net: preprocessing leaves the CPU critical path and the VLM sees the resolution
it was trained on -- faster training/inference and a correct train/serve
distribution. Affects inference too (shared preprocessor); existing checkpoints
still load (backward compatible) but must be retrained to gain the benefits.
The test loaded MODEL_PATH='aractingi/bimanual-handover-groot-10k', an N1.5
checkpoint (config base_model_path=nvidia/GR00T-N1.5-3B, no model_version). On
load, model_version defaults to n1.7 while the base path infers n1.5, so the
version-consistency guard in GrootConfig.__post_init__ raised ValueError and both
test_lerobot_groot_inference and test_lerobot_groot_forward_pass failed. N1.5 is no
longer a supported model_version.
Adopt the test for N1.7:
- MODEL_PATH -> nvidia/GR00T-N1.7-3B (root-level sharded safetensors; loads via
GrootPolicy.from_pretrained as a base N1.7 model).
- Embodiment tag 'gr1' (N1.5) -> 'gr1_unified' (valid N1.7 tag from the checkpoint
embodiment_id.json), via a single EMBODIMENT_TAG constant.
- DUMMY_ACTION_HORIZON 16 -> 40 to match N1.7's native action-chunk size.
- Docstrings/labels updated to 'GR00T N1.7'.
Both tests run and pass on CUDA; full tests/policies/groot/ suite is
73 passed / 0 failed / 0 skipped.
Mirror the tests/policies/pi0_pi05/utils convention: move dump_original_n1_7.py into
a tests/policies/groot/utils/ package (with __init__.py) and update all path
references in the test docstring/skip-message and the policy README.
- Rename test_groot_n1_7_vs_original.py -> test_groot_vs_original.py
- Make the test self-contained: producer script (dump_original_n1_7.py) now lives
next to the test; default artifact dir is repo-relative
(tests/policies/groot/artifacts/), overridable via GROOT_N1_7_PARITY_DIR. The
test only reads artifacts and skips if absent -- it never creates external dirs.
- Heavy .npz artifacts (~6-9MB each) are gitignored and regenerated by the producer;
never committed.
- Drop the verbose 'MULTIPLE EMBODIMENTS' docstring block (kept a one-line note).
- Document the parity procedure in the groot policy README (docs/source/policy_groot_README.md).
- Rename test fn test_groot_n1_7_get_action_parity -> test_groot_get_action_parity.
9/9 embodiments still pass (max|diff| < 3e-6, fp32 eps).
Generalize the original-vs-LeRobot N1.7 output-parity test from a single
libero_sim case to every embodiment tag in the checkpoint (libero_sim, oxe_droid,
real_g1, the real_r1_pro_sharpa family, and the xdof family). Inputs are built
generically from checkpoint metadata; the test discovers per-tag .npz artifacts
and runs one parametrized case each, loading the LeRobot model once via a fixture.
All 9 embodiments match the original to fp32 epsilon (max|diff| < 3e-6), confirming
the integration is correct across the model's full embodiment space and not overfit
to libero_sim.
Verifies the LeRobot GR00T N1.7 integration produces equivalent raw
action_pred to NVIDIA Isaac-GR00T for the same checkpoint, inputs, seed,
precision (fp32) and attention kernel (SDPA): max|diff|=8.9e-7 on the
libero_sim embodiment (GR00T-N1.7-LIBERO/libero_10).
The two impls pin incompatible transformers majors (orig 4.57.3 vs
LeRobot 5.x) and cannot share a process, so the original outputs + exact
collated inputs are produced out-of-process and loaded from an .npz. The
test skips on CI / when the checkpoint or artifact are absent.
* feat(depth): add depth quantization helpers and tests
* feat(video): add ffv1 to supported codecs
* feat(depth): persist depth metadata
* feat(depth): extend quantization tools to better fit the encoding/decoding pipeline
* feat(depth): plumb DepthEncoderConfig through LeRobotDataset and DatasetWriter
* feat(depth): wire StreamingVideoEncoder + writer to depth encoder
* feat(depth): wire DatasetReader to decode_depth_frames
* feat(cameras/realsense): expose async depth in metric meters
* feat(features): route 2D camera shapes to observation.depth.<key>
* feat(robots/so_follower): emit + populate depth keys when use_depth
* feat(record): plumb DepthEncoderConfig through lerobot-record
* feat(viz): render depth observations as rr.DepthImage in Viridis
* feat(depth maps writer): adding support for raw depth maps recording with image writer
* chore(format): format code
* feat(depth shape): ensuring depth maps shape is always including the channel
* feat(is_depth): simplifying is_depth nested name + legacy support
* fix(stop_event): fixing stop_event race condition in camera classes
* fix(plumbing): fixing missing parts in the depth maps pipeline
* chore(typos): fixing typos
* test(fix): fixing exisiting tests to still work with latest features
* tests(depth): adding new tests for depth integration validation
* feat(pix_fmt channels): use PyAv to check get pixel formats number of channels
* feat(refactor): refactor DepthEncoderConfig quantization pipeline, so that the methods do not live in the config class. Add pixel format - channels validation.Move the default pixel format for depth in the config file.
* fix(pre-commit): fixing mutable defautl value
* fix(info): fixing info metadata update when is_depth_map was set
* tests(typos): fixing typos in tests
* fix(realsense): fixing typo in realsense serial number
* fix(normalization): restricting 255 normalization to non depth/uint8 images only
* fix(typo): fixing typo
* fix(TIFF): add missing quantization and cleanup for TIFF files
* feat(batched dequantization): optimizing dequantize_depth for torch based batched dequantization
* feat(tools): adding depth support in LeRobotDataset edition tools
* test(aggregate): extending aggregation tests to depth frames
* test(cleaning): cleaning up tests
* fix(from_video_info): fixing early validation issue in from_video_info
* fix(typo): fixing typo
* fix(is_depth): adding missing doctrings and is_depth arguments in video decoding functions
Co-authored-by: Wensi (Vince) Ai <59036629+wensi-ai@users.noreply.github.com>
* fix(depth units): fixing depth units output for the realsense cameras
* feat(output unit): adding support for output unit specification at dataset reading/training time
Co-authored-by: Wensi (Vince) Ai <59036629+wensi-ai@users.noreply.github.com>
* test(depth): cleaning up depth tests
* test(depth encoding): updating and cleaning video/depth encoding tests
* chore(format): formatting code
* docs(depth): improving depth maps docs
* test(fix): fixing depth tests
* test(dataset tools): adding missing tests for new dataset edition tools features
* chore(format): formatting code
* fix(pyav check): fixing PyAV option validation for integer codec options by normalizing
numeric values before calling `is_integer()`
Co-authored-by: Wensi (Vince) Ai <59036629+wensi-ai@users.noreply.github.com>
* docs(mermaid): fixing mermaid diagram
* fix(rebase): rebase follow up corrections
* feat(dataset tools): adding missing docstrings and features for depth fill support in dataset edition tools
* docs(docstring): updating docstrings
* docs(dataset tools): updating docs
* fix(save images): fixing image saving in dataset tools
* fix(update video info): fixing update video info logic to match the recording and editing use cases
* test(reencode): fixing reencoding monkeypatch
* fix(review): add Claude review
* chore(format): format code
* fix(update video info): ditching the differentiated approahces for video info update - video info are always updated unless for preserved keys.
* chore(rebase): fixing rebase merge conflicts
* test(visualization): fixing visualization tests
* feat(docstrings): adding explicit docstring for encoding parameters. Docstrigns will now show up as description in the CLI --help.
* feat(mm as default): adding a global DEFAULT_DEPTH_UNIT variable setting mm as default depth unit
* fix(RGB <-> camera): renaming camera_encoder to rgb_encoder for clarity
* chore(TODO): removing deprecated TODO
* doc(write_u16_plane): improving docstrings for write_u16_plane
* feat(units): adding constants for depth frames units (m and mm)
* fix(spam): replacing spamming warning but a debug log
* feat(leagcy metadata): adding automatic metadata update for legacy 'video.is_depth_map' feature
* fix(copy&reindex): fixing metadat reshaping for single channel frames
* fix(ImageNet): excluding dpeth frames from ImageNet stats
* fix(PyAV container seek): fixing initial PyAV container seek to be robust againsy codec choice
* feat(lerobot-dataset-viz): adding support for depth in lerobot-dataset-viz
* fix(compress): removing rerun compression for DepthImages
* fix(signle channel squeeze): fixing single channel squeezing
* chore(format): format code
* fix(streaming): adding support for dequantization in streaming_dataset.py
* refactor(read depth): factorizing depth reading methods for realsense camera and adding support for depth-only usage
* chore(renaming): fixing missed RGBEncoderConfig renamings
* docs(renaming): reflecting renamings in a clearer way in the docs
* chore(annotation): excluding depth from the annotation pipeline
* feat(robots): adding depth support in compatible follower robots
* feat(LeSadKiwi): excluding LeKiwi from depth support (for now)
* chore(fail): removing misplaced file
* chore(fail): removing misplaced file
* fix(remove ffv1): removing ffv1 as it does not support MP4
* docs(cheat sheet): adding depth and video encoding to the cheat sheet
* fix(lossless): tuning depth encoding parameters for lossless depth storage
* test(fix): fixing failing tests
* depth(ZMQ): excluding ZMQ from depth support
* Revert "depth(ZMQ): excluding ZMQ from depth support"
This reverts commit b95cf4e4c2.
* fix(image transforms): excluding depth frames from images transforms
* fix(typo): typo
* fix(stats): fixing stats computation for depth frames
* fix(TIFF vs. pytorch): adding an extra uint16 to float32 conversion for depth maps stored as raw TIFF images
* fix(typos): fixing typos
* test(dtype): fixing stats computation typing tests
---------
Signed-off-by: Steven Palma <imstevenpmwork@ieee.org>
Co-authored-by: Wensi (Vince) Ai <59036629+wensi-ai@users.noreply.github.com>
Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
Co-authored-by: Wensi Ai <wsai@stanford.edu>
* refactor(training): rename eval_freq to env_eval_freq
- Rename eval_freq to env_eval_freq to distinguish sim environment evaluation from offline loss evaluation.
* feat(training): add inline offline validation with train/eval split
- Add eval_split config for balanced per-task holdout
- Add eval_steps for periodic inline eval loss computation
- Add max_eval_samples to cap eval cost
* fix(datasets): remap absolute indices in __getitem__ for filtered datasets
* fix(train): vectorize eval subset selection for max_eval_samples
* fix(datasets): Move the remapping into EpisodeAwareSampler via absolute_to_relative_idx
* fix(validation): add eval_split range check and eval_steps warning
Validate eval_split is in [0.0, 1.0) to prevent garbage splits from
out-of-range values. Raise when eval_steps > 0 but eval_split is 0.0
since no offline eval will run.
* fix(train): prepare eval dataloader with accelerator for multi-GPU
Prepare eval_dataloader through accelerator.prepare() so eval data is
sharded across ranks instead of duplicated. Reduce eval_loss across
ranks with mean reduction for consistent logging.
* fix(test): rename eval_freq to env_eval_freq for multi-GPU training
Align the MolmoAct2 implementation with lerobot codebase conventions:
- Rename hf_model/ to molmoact2_hf_model/
- Slim config: move all I/O and runtime logic to modeling
- Remove blanket from 8 vendored files, fix 66 lint issues
- Deduplicate _hf_token() and _resolve_checkpoint_location()
- Make huggingface_hub imports lazy
- Remove custom MolmoAct2CosineDecayWithWarmupSchedulerConfig, use base class
- Extract 13 static/classmethods from MolmoAct2Policy to free functions
- Replace print() with logger in vendored action_tokenizer
- Add module docstrings, class docstring, and key method docstrings
- Add module-level loggers to modeling and processor
- Fix docs: pip to uv install, deduplicate README symlink
- Remove shebangs from all files
* feat(train): FSDP checkpoint saving
* adding docs for FSDP
* adding a test for the fsdp checkpoint path
* cleanup
* fixing final upload to hub
* refactored initial implementation to use torch fsdp api and adding new tests
* fix(datasets): enforce one parquet row group per episode in v3 data writes
LeRobot v3 data shards must hold exactly one row group per episode so a
reader can fetch episode i with pq.ParquetFile(path).read_row_group(i)
(a byte-range read) instead of loading the whole shard. The recording
writer already does this (one write_table per episode); the aggregate
and lerobot-annotate re-write paths instead concatenated many episodes
and wrote them in one shot, collapsing the file to a single row group.
- io_utils: add write_table_one_row_group_per_episode (one ParquetWriter,
one write_table per episode — same pattern as the recording writer);
to_parquet_with_hf_images embeds images then writes per-episode row
groups; to_parquet_one_row_group_per_episode wraps it for plain frames
- aggregate: route non-image data writes through the per-episode writer;
leave the episodes-metadata parquet untouched (already one row/episode)
- annotate: rewrite shards via the per-episode writer instead of a single
bulk pq.write_table
- tests: invariant coverage through the aggregate (image + video) and
annotate paths
No change to on-disk schema, paths, naming, rollover thresholds, or
compression. Readers stay backward-compatible (old collapsed files load).
* Update src/lerobot/datasets/io_utils.py
Co-authored-by: Caroline Pascal <caroline8.pascal@gmail.com>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
* Update src/lerobot/datasets/io_utils.py
Co-authored-by: Caroline Pascal <caroline8.pascal@gmail.com>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
* fix(datasets): correct indentation and add strict= in row-group helper
The web-edited numpy version of write_table_one_row_group_per_episode had an
over-indented line (IndentationError, breaking pre-commit + test collection)
and a zip() without strict=. Fix both; behaviour unchanged.
---------
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
Co-authored-by: Caroline Pascal <caroline8.pascal@gmail.com>
* fix(images/videos): fixing aggregate_pipeline_dataset_features to avoid unwanted images features deletion when videos are not used
* fix(docstrings): improving docstrings
Signed-off-by: Caroline Pascal <caroline8.pascal@gmail.com>
---------
Signed-off-by: Caroline Pascal <caroline8.pascal@gmail.com>
* chore(robots): homogenize bi setups
* feat(robots): split openarm mini into single and bi
* refactor(robots): mixin for bi classes
* docs: update docs
* feat(edit-dataset): add `concatenate_videos` opt-out to merge
When merging datasets, source mp4s are concatenated into shards capped at
`video_files_size_in_mb` (default 200 MB). This is great for dataloader
throughput but destroys per-episode (or per-source) video boundaries,
which is undesirable when you want to inspect, ship, or reuse the
individual mp4s.
Add a `concatenate_videos: bool = True` knob plumbed through
`MergeConfig` → `merge_datasets` → `aggregate_datasets` → `aggregate_videos`.
When False, each source mp4 is copied 1:1 to its own destination mp4 with
no re-muxing, so the merge preserves source video boundaries.
Usage:
lerobot-edit-dataset \
--new_repo_id user/merged \
--operation.type=merge \
--operation.repo_ids "['user/a', 'user/b']" \
--operation.concatenate_videos=false
Defaults are unchanged; the dataloader path is unaffected because the
`episodes.parquet` `from_timestamp`/`to_timestamp` index keeps working
regardless of whether each mp4 holds one or many episodes.
* feat(edit-dataset): extend concatenate opt-out to data files
Following review, add a concatenate_data flag mirroring concatenate_videos,
threaded through MergeConfig, merge_datasets, aggregate_datasets, aggregate_data
and append_or_create_parquet_file. Metadata index files still always concatenate.
Also trim the verbose docstrings and comments since the names are
self-explanatory, and extend the existing merge test to cover data files.
Steerable annotation pipeline (lerobot-annotate) that populates the language_persistent and language_events columns introduced in PR 1 (#3467) directly into data/chunk-*/file-*.parquet.
This is PR 2 of the three-PR plan:
PR 1 (Add extensive language support #3467): schema + DSL + rendering, base of this PR
PR 2 (this PR): annotation pipeline writing into PR 1's columns
PR 3: model with language prediction and runtime
A VLM (Qwen-VL family, served on vLLM) watches each episode's video and emits grounded language annotations: subtasks, plans, memory, task rephrasings, interjections + speech, and per-camera VQA. The pipeline is built for production annotation at scale — single-camera grounding, embedded-frame inputs, a describe-then-segment grounding flow, and a deterministic full-episode coverage guarantee — informed by Scale's dense-captioning findings (representation > sampling, rules > reasoning, model capacity is the biggest lever, two-pass systems compound errors)