Compare commits

..

45 Commits

Author SHA1 Message Date
CarolinePascal 4445849b86 feat(depth maps writer): adding support for raw depth maps recording with image writer 2026-05-01 00:49:09 +02:00
CarolinePascal f43bf75f9b fix(viz): anchor rerun DepthImage colormap to encoder depth range 2026-05-01 00:49:09 +02:00
CarolinePascal b540fa94a9 feat(viz): render depth observations as rr.DepthImage in Viridis
log_rerun_data now accepts an optional `features` dict and uses the
`video.is_depth_map=True` info marker to detect depth observations.
Matching 2D arrays are logged as `rr.DepthImage(arr, meter=1.0,
colormap=rr.components.Colormap.Viridis)` and are never JPEG-compressed
(compression is lossy on float32 metric depth).

Detection covers both the namespaced dataset key
(e.g. `observation.depth.front`) and the raw observation keys the robot
emits (`front`, `front_depth`), so it works for both the typed
LeRobotDataset.features dict and the plain robot observation flow.

When `features` is None the previous behaviour is preserved (depth
arrays fall back to the generic `rr.Image` path), so non-depth
recordings and existing call sites are unaffected.

lerobot-record now forwards `dataset.features` so depth keys are picked
up automatically when `--display_data=true`.

Made-with: Cursor
2026-05-01 00:49:09 +02:00
CarolinePascal efad15f600 feat(record): plumb DepthEncoderConfig through lerobot-record
Surface DepthEncoderConfig and depth_encoder_defaults from
lerobot.datasets, and wire dataset.depth_encoder_config through
LeRobotDataset.create() and LeRobotDataset.resume() so depth-capable
recordings (e.g. RealSense use_depth=True) can be tuned from the CLI:

    --dataset.depth_encoder_config.depth_min=0.1
    --dataset.depth_encoder_config.depth_max=4.0
    --dataset.depth_encoder_config.vcodec=ffv1

The default factory keeps depth-stream defaults (12-bit HEVC, log
quantization), so non-depth recordings are unaffected.

Made-with: Cursor
2026-05-01 00:49:09 +02:00
CarolinePascal 407d1882a2 feat(robots/so_follower): emit + populate depth keys when use_depth
When an SO follower has a camera configured with use_depth=True (e.g.
a RealSense), the robot now exposes a paired depth feature so the
dataset records both modalities:

- _cameras_ft adds a 2D "<cam>_depth" entry alongside the 3-channel
  color shape; hw_to_dataset_features turns this into
  observation.depth.<cam> with the depth-map marker.
- get_observation reads cam.read_latest_depth() (float32 metric
  meters from the RealSense async depth API) into <cam>_depth so
  build_dataset_frame can route it.

Detection is duck-typed via getattr(..., "use_depth", False) so other
cameras without that attribute keep their RGB-only behaviour unchanged.

Made-with: Cursor
2026-05-01 00:49:09 +02:00
CarolinePascal 0d6e4f3bad feat(features): route 2D camera shapes to observation.depth.<key>
hw_to_dataset_features now treats a camera entry whose shape has
length 2 as a single-channel depth feature: it emits the feature as
"{prefix}.depth.<bare>" with names=["height", "width"] and an
info={"video.is_depth_map": True} marker so the depth-encoder branch
in LeRobotDataset is engaged. The "_depth" hardware-side suffix (if
present) is stripped so a paired RGB + depth camera ends up as
"observation.images.<cam>" + "observation.depth.<cam>".

build_dataset_frame mirrors the routing: depth feature keys read
their value from "<bare>_depth" in the raw observation dict, with
fallback to the bare name for producers that already emit
dataset-style keys.

Tests: add tests/utils/test_feature_utils.py covering the routing
of 2D vs 3D camera shapes, the paired RGB+depth case, and the
build_dataset_frame value routing.

Made-with: Cursor
2026-05-01 00:49:09 +02:00
CarolinePascal 536b29d963 feat(cameras/realsense): expose async depth in metric meters 2026-05-01 00:48:40 +02:00
CarolinePascal 2744e26593 feat(depth): wire DatasetReader to decode_depth_frames 2026-05-01 00:41:38 +02:00
CarolinePascal de64ad3f7e feat(depth): wire StreamingVideoEncoder + writer to depth encoder 2026-05-01 00:29:34 +02:00
CarolinePascal d777359662 feat(depth): plumb DepthEncoderConfig through LeRobotDataset and DatasetWriter 2026-04-30 23:55:28 +02:00
CarolinePascal 5d0a20bd9c feat(video): alias "av1" to "libsvtav1" for backward compat 2026-04-30 23:43:02 +02:00
CarolinePascal 2c796d3352 feat(depth): persist depth metadata + add reader helpers 2026-04-30 23:38:56 +02:00
CarolinePascal df1648c102 feat(video): add ffv1 to supported codecs 2026-04-30 17:32:50 +02:00
CarolinePascal 3bd96a4346 feat(depth): add depth quantization helpers and tests 2026-04-30 17:31:03 +02:00
CarolinePascal 016799dfa1 chore(format): formatting code 2026-04-30 14:42:37 +02:00
CarolinePascal 51b9038458 chore(PyAV): cleaning up PyAV utils and encoding parameters checks to stick to the minimun required tooling. 2026-04-30 14:31:08 +02:00
CarolinePascal cc9a2e5c99 chore(format): fixing formatting issues 2026-04-29 16:48:57 +02:00
CarolinePascal a2376389f9 test(new): adding new tests for encoding related features 2026-04-29 16:48:56 +02:00
CarolinePascal 57a619ab02 test(existing): adapting existing tests 2026-04-29 16:48:56 +02:00
CarolinePascal 7f624adcc5 chore(duplicate): removing duplicate get_codec_options definition 2026-04-29 16:48:56 +02:00
CarolinePascal 375cf1fdf3 feat(pyav checks): making pyav parameters checks more robust 2026-04-29 16:48:56 +02:00
CarolinePascal b2c2bb7641 feat(VideoEncoderConfig init): making VideoEncoderConfig more robust and adaptable to multiple backends 2026-04-29 16:48:56 +02:00
CarolinePascal 4a87ee1537 fix(concatenation compatibility): adding compatibility check when concatenating video files 2026-04-29 16:48:56 +02:00
CarolinePascal e44f86e516 feat(metadata): adding encoding parameters in dataset metadata 2026-04-29 16:48:56 +02:00
CarolinePascal a0e3acdb67 chore(docs): updating the docs 2026-04-29 16:46:16 +02:00
CarolinePascal 38ff579bcc feat(VideoEncoderConfig): propagating the VideoEncoderConfig in the codebase 2026-04-29 16:44:47 +02:00
CarolinePascal 479e444517 feat(VideoEncoderConfig): creating a VideoEncoderConfig to encapsulate encoding parameters 2026-04-29 16:42:14 +02:00
CarolinePascal 9787b8fa26 feat(pyav utils): adding suport for PyAV encoding parameters validation 2026-04-29 16:42:14 +02:00
CarolinePascal 71f39f6912 chore(video backend): renaming codec into video_backend in get_safe_default_video_backend() 2026-04-29 16:42:14 +02:00
Khalil Meftah b5f65e5332 Expose sarm package API and ship reward model card template (#3477)
* chore: List lerobot_rewardmodel_modelcard_template.md in MANIFEST.in

* chore: export SARMConfig, SARMRewardModel, and make_sarm_pre_post_processors from rewards.sarm.
2026-04-29 16:17:16 +02:00
Khalil Meftah cd6b43ea7a fix(train): migrate legacy RA-BC fields in train config loading (#3480) 2026-04-29 16:17:00 +02:00
Steven Palma 2236bbe7a3 fix(rollout): propagate policy-specific CLI config paramaters (#3483)
Co-authored-by: Maxime Ellerbach <maxime.ellerbach@huggingface.co>
2026-04-29 16:13:10 +02:00
Maxime Ellerbach cb0a944941 refactor(datasets): replace untyped dict with typed DatasetInfo dataclass (#3472)
* refactor(datasets): replace untyped dict with typed DatasetInfo dataclass

Introduce typed DatasetInfo dataclass to replace untyped dict representation of info.json.

Changes:
- Add DatasetInfo dataclass with explicit fields and validation
- Implement __post_init__ for shape conversion (list ↔ tuple)
- Add dict-style compatibility layer (__getitem__, __setitem__, .get())
- Add from_dict() and to_dict() for JSON serialization
- Update io_utils to use load_info/write_info with DatasetInfo
- Update dataset utilities and metadata to use attribute access
- Remove aggregate.py dict-style field access
- Add tests fixture support for DatasetInfo

Benefits:
- Type safety with IDE auto-completion
- Validation at construction time
- Explicit schema documentation

* fix pre-commit

* update docstring inside DatasetInfo.from_dict()

* sorts the unknown to have deterministic output

Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>

* refactoring the last few old fieds


* fix crop dataset roi type mismatch


* use consistantly int for data and video_files_size_in_mb

---------

Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>
Co-authored-by: jjolla93 <jjolla93@gmail.com>
2026-04-28 18:40:30 +02:00
Khalil Meftah 8a3d64033f Reward models refactor (#3142)
* feat(rewards): add RewardModelConfig and PreTrainedRewardModel base classes

* refactor(rewards): migrate Classifier from policies/sac/reward_model/ to rewards/classifier/

* refactor(rewards): migrate SARM from policies/sarm/ to rewards/sarm/

* refactor(rewards): add rewards/factory.py and remove reward model code from policies/factory.py

* refactor(rewards): update imports and delete old reward model locations

* test(rewards): add reward model tests and update existing test imports

* fix(rewards): restore full Classifier and SARM implementations

* test(rewards): restore missing CUDA and mixed precision classifier processor tests

* refactor(lerobot_train.py): remove rabc specific configuration and replace it with a generic samplerweight class in lerobot_train

* refactor(lerobot_train.py): add missing sampling weight script

* linter + missing files

* add testing for sampl weighter

* revert some useless changes, improve typing

* update docs

* add automatic detection of the progress path

* remove type exp

* improve comment

* fix: move rabc.py to rewards/sarm/ and update import paths

* refactor(imports): update reward model imports to new module structure

* refactor(imports): update reward model imports to reflect new module structure

* refactor(imports): conditionally import pandas based on availability

* feat(configs): add reward_model field to TrainPipelineConfig and Hub fields to RewardModelConfig

* refactor(policies): remove reward model branches from policy factory and __init__

* refactor(rewards): expand __init__ facade and fix SARMConfig __post_init__ crash

* feat(train): route reward model training through rewards/factory instead of policies/factory

* refactor(train): streamline reward model training logic

* fix(rewards): ensure FileNotFoundError is raised for missing config_file

* refactor(train): update __get_path_fields__ to include reward_model for config loading

* refactor(classifier): remove redundant input normalization in predict_reward method

* fix(train): raise ValueError for non-trainable reward models in train function

* refactor(pretrained_rm): add model card template

* refactor(tests): reward models

* refactor(sarm): update reset method and remove unused action prediction methods

* refactor(wandb): differentiate tags for reward model and policy training in cfg_to_group function

* fix(train): raise ValueError for PEFT usage in reward model training

* refactor(rewards): enhance RewardModelConfig with device handling and delta indices properties

---------

Co-authored-by: Michel Aractingi <michel.aractingi@huggingface.co>
2026-04-28 17:56:24 +02:00
Steven Palma 03ee50e08f chore(ci): bump docs workflows (#3476) 2026-04-28 15:06:44 +02:00
Steven Palma ca87ccd941 feat(rollout): decouple policy deployment from data recording with new lerobot-rollout CLI (#3413)
* feat(scripts): lerobot-rollout

* fix(rollout) require dataset in dagger + use duration too

* fix(docs): dagger num_episodes

* test(rollout): fix expectations

* fix(rollout): features check

* fix(rollout): device and task propagation + feature pos + warn fps + move rename_map config

* docs(rollout): edit rename_map instructions

* chore(rollout): multiple minor improvements

* chore(rollout): address coments + minor improvements

* fix(rollout): enable default

* fix(tests): default value RTCConfig

* fix(rollout): robot_observation_processor and notify_observation at policy frequency instead of interpolator rate

Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* fix(rollout): prevent relativeactions with sync inference engine

Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* fix(rollout): rtc reanchor to non normalized state

Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* fix(rollout): fixing the episode length to use hwc (#3469)

also reducing default length to 5 minutes

* feat(rollout): go back to initial position is now a config

* fix(rollout): properly propagating video_files_size_in_mb to lerobot_dataset (#3470)

* chore(rollout): note about dagger correction stage

* chore(docs): update comments and docstring

* fix(test): move rtc relative out of rollout module

* fix(rollout): address the review comments

---------

Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>
Co-authored-by: Maxime Ellerbach <maxime.ellerbach@huggingface.co>
2026-04-28 00:57:35 +02:00
Steven Palma 77352c495c chore(dependencies): update uv.lock (#3437)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-04-27 23:15:46 +02:00
Steven Palma 05a5223885 fix(pi): avoid peak RAM in PiGemma construction by freeing replaced submodules (#3454)
Co-Authored-By: Daiki Kamata <daiki.kamata@access-company.com>
Co-Authored-By: Jack Vial <jackvial@users.noreply.github.com>
Co-Authored-By: Ajay Anubolu <AjAnubolu@users.noreply.github.com>
Co-Authored-By: Finn F. <F-Fer@users.noreply.github.com>
2026-04-24 17:50:12 +02:00
Steven Palma 580d818aa9 fix(dataset): no default overwrite in lerobot tool recompute stats (#3452) 2026-04-24 15:07:19 +02:00
Steven Palma 587aa82021 fix(imports): realsense import name is platform dependent (#3451) 2026-04-24 12:55:38 +02:00
Chuyao Shen 12b88fce02 not use dataclass (#3414)
Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>
2026-04-24 11:26:59 +02:00
masato-ka fc6c94c82a fix(sarm): handle BaseModelOutputWithPooling from transformers 5.x in… (#3419)
* fix(sarm): handle BaseModelOutputWithPooling from transformers 5.x in CLIP encoding

In transformers 5.x, CLIPModel.get_image_features() and get_text_features()
return BaseModelOutputWithPooling instead of a plain torch.FloatTensor.
Added isinstance check to extract pooler_output when the return value is not
a tensor, maintaining backward compatibility with transformers 4.x.

Fixes AttributeError: 'BaseModelOutputWithPooling' object has no attribute 'detach'

* Adding assertion check for pooler_output of CLIP. This change is response to below comment.
https://github.com/huggingface/lerobot/pull/3419#discussion_r3112594387

* Adding assertion check for pooler_output of CLIP. This change is response to below comment. Change to simple check and rise
https://github.com/huggingface/lerobot/pull/3419#discussion_r3126953776

---------
Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>
2026-04-23 16:26:58 +02:00
Steven Palma 1add460678 fix(policy): loss normalization for padded actions in ACT, Diffusion, and MultiTaskDiT (#3442)
* Fix loss normalization for padded actions in ACT, Diffusion, and MultiTaskDiT

When action_is_pad masks out padded timesteps, the subsequent .mean()
still divides by the total element count (including zeroed-out padding),
underestimating the loss. With 60-70% padding this can cut the effective
gradient signal by 2-3x.

Replace mask-then-mean with mask-then-sum / valid-count for all three
affected policies. TDMPC is not affected because it sums over time
before averaging over batch.

Fixes #3353

* linting

Co-authored-by: whats2000 <60466660+whats2000@users.noreply.github.com>
Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>

* Update src/lerobot/policies/diffusion/modeling_diffusion.py

Co-authored-by: whats2000 <60466660+whats2000@users.noreply.github.com>
Signed-off-by: Steven Palma <imstevenpmwork@ieee.org>

* Update src/lerobot/policies/multi_task_dit/modeling_multi_task_dit.py

Co-authored-by: whats2000 <60466660+whats2000@users.noreply.github.com>
Signed-off-by: Steven Palma <imstevenpmwork@ieee.org>

* Update src/lerobot/policies/multi_task_dit/modeling_multi_task_dit.py

Co-authored-by: whats2000 <60466660+whats2000@users.noreply.github.com>
Signed-off-by: Steven Palma <imstevenpmwork@ieee.org>

* apply ACT loss normalization suggestion from review

Divide by num_valid (timesteps * action_dim) instead of just timesteps,
matching the diffusion/multi_task_dit fix. Addresses review from
@whats2000 (https://github.com/huggingface/lerobot/pull/3377#discussion_r3106845791).

* fix(test): update safetensor act

---------

Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>
Signed-off-by: Steven Palma <imstevenpmwork@ieee.org>
Co-authored-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
Co-authored-by: Maxime Ellerbach <maxime@ellerbach.net>
Co-authored-by: whats2000 <60466660+whats2000@users.noreply.github.com>
2026-04-23 15:23:54 +02:00
Qi Jia 4587c2b648 fix xvla docs (#3291)
Co-authored-by: Qi Jia <kaufou@gmail.com>
Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
2026-04-23 14:50:32 +02:00
whats2000 2236cdb302 fix(smolvla): correct loss normalization for padded actions (#3434)
Apply the same per-scalar-mean fix to SmolVLA that #3377 landed for
ACT / Diffusion / MultiTaskDiT. The pre-patch form applies the
`action_is_pad` mask to zero out padded timesteps, then calls `.mean()`
(or `.mean(dim=(1, 2))`). Because `.mean()` divides by the total number
of elements including the zeroed padding, the loss is diluted by the
padding fraction.

Fixed by normalizing only over valid (non-padded) scalar entries:

    num_valid = ((~actions_is_pad).sum(...) * losses.shape[-1]).clamp_min(1)
    loss = losses.sum(...) / num_valid

`clamp_min(1)` preserves the all-padded-batch edge case (0/1 = 0). Both
reduction paths are updated. Behavior when `action_is_pad` is missing is
unchanged (`losses.mean()`).

Empirical A/B on aloha_sim_transfer_cube_human (chunk_size=40, batch=2,
30 steps, fixed seed, GB200) shows `loss_A / loss_B = 0.9672 (±0.088)` —
same direction and magnitude as PR #3377's `loss_A / loss_C ≈ 0.96` for
ACT. Heavier-padding recipes will see a larger gap.

Refs: #3353 (original report for ACT), #3377 (fix for the other three
policies).
2026-04-23 10:34:11 +02:00
127 changed files with 6020 additions and 1731 deletions
@@ -33,7 +33,7 @@ jobs:
github.event.workflow_run.event == 'pull_request' &&
github.event.workflow_run.conclusion == 'success' &&
github.repository == 'huggingface/lerobot'
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@9ad2de8582b56c017cb530c1165116d40433f1c6 # main
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@2430c1ec91d04667414e2fa31ecfc36c153ea391 # main
with:
package_name: lerobot
secrets:
+2 -2
View File
@@ -55,7 +55,7 @@ jobs:
github.repository == 'huggingface/lerobot'
permissions:
contents: read
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@90b4ee2c10b81b5c1a6367c4e6fc9e2fb510a7e3 # main
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@2430c1ec91d04667414e2fa31ecfc36c153ea391 # main
with:
commit_sha: ${{ github.sha }}
package: lerobot
@@ -78,7 +78,7 @@ jobs:
permissions:
contents: read
pull-requests: write
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@90b4ee2c10b81b5c1a6367c4e6fc9e2fb510a7e3 # main
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@2430c1ec91d04667414e2fa31ecfc36c153ea391 # main
with:
commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }}
+1
View File
@@ -1,3 +1,4 @@
include src/lerobot/templates/lerobot_modelcard_template.md
include src/lerobot/templates/lerobot_rewardmodel_modelcard_template.md
include src/lerobot/datasets/card_template.md
include src/lerobot/envs/metaworld_config.json
+8 -4
View File
@@ -39,6 +39,7 @@ from tqdm import tqdm
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.datasets.video_utils import (
VideoEncoderConfig,
decode_video_frames,
encode_video_frames,
)
@@ -251,10 +252,13 @@ def benchmark_encoding_decoding(
imgs_dir=imgs_dir,
video_path=video_path,
fps=fps,
vcodec=encoding_cfg["vcodec"],
pix_fmt=encoding_cfg["pix_fmt"],
g=encoding_cfg.get("g"),
crf=encoding_cfg.get("crf"),
camera_encoder_config=VideoEncoderConfig(
vcodec=encoding_cfg["vcodec"],
pix_fmt=encoding_cfg["pix_fmt"],
g=encoding_cfg.get("g"),
crf=encoding_cfg.get("crf"),
preset=encoding_cfg.get("preset"),
),
# fast_decode=encoding_cfg.get("fastdecode"),
overwrite=True,
)
+1 -1
View File
@@ -90,6 +90,6 @@ lerobot-record \
--dataset.single_task="Your task description" \
--dataset.streaming_encoding=true \
--dataset.encoder_threads=2 \
# --dataset.vcodec=auto \
# --dataset.camera_encoder_config.vcodec=auto \
--policy.path=${HF_USER}/act_policy
```
+1 -1
View File
@@ -194,7 +194,7 @@ lerobot-record \
--dataset.single_task="Navigate around obstacles" \
--dataset.streaming_encoding=true \
--dataset.encoder_threads=2 \
# --dataset.vcodec=auto \
# --dataset.camera_encoder_config.vcodec=auto \
--display_data=true
```
+1 -1
View File
@@ -123,7 +123,7 @@ lerobot-record \
--dataset.single_task="Grab and handover the red cube to the other arm" \
--dataset.streaming_encoding=true \
--dataset.encoder_threads=2 \
# --dataset.vcodec=auto \
# --dataset.camera_encoder_config.vcodec=auto \
--policy.path=<user>/groot-bimanual \ # your trained model
--dataset.episode_time_s=30 \
--dataset.reset_time_s=10
+2 -2
View File
@@ -232,7 +232,7 @@ lerobot-record \
--dataset.private=true \
--dataset.streaming_encoding=true \
--dataset.encoder_threads=2 \
# --dataset.vcodec=auto \
# --dataset.camera_encoder_config.vcodec=auto \
--display_data=true
```
@@ -278,6 +278,6 @@ lerobot-record \
--dataset.num_episodes=10 \
--dataset.streaming_encoding=true \
--dataset.encoder_threads=2 \
# --dataset.vcodec=auto \
# --dataset.camera_encoder_config.vcodec=auto \
--policy.path=outputs/train/hopejr_hand/checkpoints/last/pretrained_model
```
+1 -1
View File
@@ -193,7 +193,7 @@ lerobot-record \
--dataset.num_episodes=5 \
--dataset.single_task="Grab the black cube" \
--dataset.streaming_encoding=true \
# --dataset.vcodec=auto \
# --dataset.camera_encoder_config.vcodec=auto \
--dataset.encoder_threads=2
```
</hfoption>
+1 -1
View File
@@ -43,7 +43,7 @@ lerobot-record \
--dataset.num_episodes=5 \
--dataset.single_task="Grab the black cube" \
--dataset.streaming_encoding=true \
# --dataset.vcodec=auto \
# --dataset.camera_encoder_config.vcodec=auto \
--dataset.encoder_threads=2
```
+2 -2
View File
@@ -161,7 +161,7 @@ lerobot-record \
--dataset.private=true \
--dataset.streaming_encoding=true \
--dataset.encoder_threads=2 \
# --dataset.vcodec=auto \
# --dataset.camera_encoder_config.vcodec=auto \
--display_data=true
```
@@ -203,7 +203,7 @@ lerobot-record \
--dataset.private=true \
--dataset.streaming_encoding=true \
--dataset.encoder_threads=2 \
# --dataset.vcodec=auto \
# --dataset.camera_encoder_config.vcodec=auto \
--display_data=true
```
+29 -28
View File
@@ -46,7 +46,7 @@ This ensures identical task states map to consistent progress values, even acros
## Inputs and Targets (What the new code expects)
SARM is trained through its processor (`src/lerobot/policies/sarm/processor_sarm.py`), which:
SARM is trained through its processor (`src/lerobot/rewards/sarm/processor_sarm.py`), which:
- **Encodes** images and task text with CLIP (ViT-B/32) into `video_features` and `text_features`
- **Pads/truncates** robot state into `state_features` (up to `max_state_dim`)
@@ -347,7 +347,7 @@ Use `compute_rabc_weights.py` with `--visualize-only` to visualize model predict
<hfoption id="single_stage">
```bash
python src/lerobot/policies/sarm/compute_rabc_weights.py \
python -m lerobot.rewards.sarm.compute_rabc_weights \
--dataset-repo-id your-username/your-dataset \
--reward-model-path your-username/sarm-model \
--visualize-only \
@@ -360,7 +360,7 @@ python src/lerobot/policies/sarm/compute_rabc_weights.py \
<hfoption id="dense_only">
```bash
python src/lerobot/policies/sarm/compute_rabc_weights.py \
python -m lerobot.rewards.sarm.compute_rabc_weights \
--dataset-repo-id your-username/your-dataset \
--reward-model-path your-username/sarm-model \
--visualize-only \
@@ -373,7 +373,7 @@ python src/lerobot/policies/sarm/compute_rabc_weights.py \
<hfoption id="dual">
```bash
python src/lerobot/policies/sarm/compute_rabc_weights.py \
python -m lerobot.rewards.sarm.compute_rabc_weights \
--dataset-repo-id your-username/your-dataset \
--reward-model-path your-username/sarm-model \
--visualize-only \
@@ -429,7 +429,7 @@ The weighting follows **Equations 8-9** from the paper:
First, run the SARM model on all frames in your dataset to compute progress values:
```bash
python src/lerobot/policies/sarm/compute_rabc_weights.py \
python -m lerobot.rewards.sarm.compute_rabc_weights \
--dataset-repo-id your-username/your-dataset \
--reward-model-path your-username/sarm-model \
--head-mode sparse \
@@ -465,15 +465,15 @@ This script:
### Step 5b: Train Policy with RA-BC
Once you have the progress file, train your policy with RA-BC weighting. The progress file is auto-detected from the dataset path (`sarm_progress.parquet`). Currently PI0, PI0.5 and SmolVLA are supported with RA-BC:
Once you have the progress file, train your policy with RA-BC weighting. The progress file is auto-detected from the dataset path (`sarm_progress.parquet`) if not explicitly provided. Currently PI0, PI0.5 and SmolVLA are supported with RA-BC:
```bash
lerobot-train \
--dataset.repo_id=your-username/your-dataset \
--policy.type=pi0 \
--use_rabc=true \
--rabc_head_mode=sparse \
--rabc_kappa=0.01 \
--sample_weighting.type=rabc \
--sample_weighting.head_mode=sparse \
--sample_weighting.kappa=0.01 \
--output_dir=outputs/train/policy_rabc \
--batch_size=32 \
--steps=40000
@@ -488,12 +488,13 @@ The training script automatically:
**RA-BC Arguments:**
| Argument | Description | Default |
| ---------------------- | ---------------------------------------------------------- | ---------------------------------- |
| `--use_rabc` | Enable RA-BC sample weighting | `false` |
| `--rabc_progress_path` | Path to progress parquet file (auto-detected from dataset) | `sarm_progress.parquet` in dataset |
| `--rabc_head_mode` | Which SARM head's progress to use: `sparse` or `dense` | `sparse` |
| `--rabc_kappa` | Threshold κ for high-quality samples | `0.01` |
| Argument | Description | Default |
| ---------------------------------- | ------------------------------------------------------ | ----------------------- |
| `--sample_weighting.type` | Weighting strategy type (`rabc` or `uniform`) | `rabc` |
| `--sample_weighting.progress_path` | Path to progress parquet file | `sarm_progress.parquet` |
| `--sample_weighting.head_mode` | Which SARM head's progress to use: `sparse` or `dense` | `sparse` |
| `--sample_weighting.kappa` | Threshold κ for high-quality samples | `0.01` |
| `--sample_weighting.epsilon` | Small constant for numerical stability | `1e-6` |
### Tuning RA-BC Kappa
@@ -511,30 +512,30 @@ The `kappa` parameter is the threshold that determines which samples get full we
Monitor these WandB metrics during training:
| Metric | Healthy Range | Problem Indicator |
| ------------------ | ------------- | ------------------------- |
| `rabc_mean_weight` | 0.3 - 0.8 | ≈ 1.0 means kappa too low |
| `rabc_delta_mean` | > 0 | Should be positive |
| `rabc_delta_std` | > 0 | Variance in data quality |
| Metric | Healthy Range | Problem Indicator |
| ----------------------------- | ------------- | ------------------------- |
| `sample_weight_mean_weight` | 0.3 - 0.8 | ≈ 1.0 means kappa too low |
| `sample_weighting/delta_mean` | > 0 | Should be positive |
| `sample_weighting/delta_std` | > 0 | Variance in data quality |
**If `rabc_mean_weight ≈ 1.0`:** Your kappa is too low. Most samples have `delta > kappa` and bypass the soft-weighting entirely. RA-BC becomes equivalent to vanilla BC.
**If `sample_weight_mean_weight ≈ 1.0`:** Your kappa is too low. Most samples have `delta > kappa` and bypass the soft-weighting entirely. RA-BC becomes equivalent to vanilla BC.
**Setting kappa based on your data:**
The default `kappa=0.01` was tuned for the paper's T-shirt folding task (~90s episodes at 30fps). For your dataset, check the logged `rabc_delta_mean` and `rabc_delta_std`:
The default `kappa=0.01` was tuned for the paper's T-shirt folding task (~90s episodes at 30fps). For your dataset, check the logged `sample_weighting/delta_mean` and `sample_weighting/delta_std`:
```
# If delta_mean ≈ 0.03 and delta_std ≈ 0.02:
# Most deltas fall in range [0.01, 0.05]
# Option 1: Set kappa = delta_mean (medium selectivity)
--rabc_kappa=0.03
--sample_weighting.kappa=0.03
# Option 2: Set kappa = delta_mean + delta_std (high selectivity)
--rabc_kappa=0.05
--sample_weighting.kappa=0.05
# Option 3: Set kappa = delta_mean + 2*delta_std (very selective)
--rabc_kappa=0.07
--sample_weighting.kappa=0.07
```
**When RA-BC may not help:**
@@ -550,8 +551,8 @@ accelerate launch \
src/lerobot/scripts/lerobot_train.py \
--dataset.repo_id=your-username/your-dataset \
--policy.type=pi0 \
--use_rabc=true \
--rabc_kappa=0.01 \
--sample_weighting.type=rabc \
--sample_weighting.kappa=0.01 \
--output_dir=outputs/train/policy_rabc \
--batch_size=32 \
--steps=40000
@@ -576,7 +577,7 @@ accelerate launch \
### RA-BC
1. **Train SARM first**: RA-BC quality depends entirely on SARM quality
2. **Monitor `rabc_mean_weight`**: If it's ≈ 1.0, increase kappa (see [Tuning RA-BC Kappa](#tuning-ra-bc-kappa))
2. **Monitor `sample_weight_mean_weight`**: If it's ≈ 1.0, increase kappa (see [Tuning RA-BC Kappa](#tuning-ra-bc-kappa))
---
+1 -1
View File
@@ -108,7 +108,7 @@ lerobot-record \
--dataset.num_episodes=10 \
--dataset.streaming_encoding=true \
--dataset.encoder_threads=2 \
# --dataset.vcodec=auto \
# --dataset.camera_encoder_config.vcodec=auto \
# <- Teleop optional if you want to teleoperate in between episodes \
# --teleop.type=so100_leader \
# --teleop.port=/dev/ttyACM0 \
+37 -27
View File
@@ -14,12 +14,22 @@ This makes `save_episode()` near-instant (the video is already encoded by the ti
## 2. Tuning Parameters
| Parameter | CLI Flag | Type | Default | Description |
| ----------------------- | --------------------------------- | ------------- | ------------- | ----------------------------------------------------------------- |
| `streaming_encoding` | `--dataset.streaming_encoding` | `bool` | `True` | Enable real-time encoding during capture |
| `vcodec` | `--dataset.vcodec` | `str` | `"libsvtav1"` | Video codec. `"auto"` detects best HW encoder |
| `encoder_threads` | `--dataset.encoder_threads` | `int \| None` | `None` (auto) | Threads per encoder instance. `None` will leave the vcoded decide |
| `encoder_queue_maxsize` | `--dataset.encoder_queue_maxsize` | `int` | `60` | Max buffered frames per camera (~2s at 30fps). Consumes RAM |
All encoding parameters are grouped under `camera_encoder_config` (a `VideoEncoderConfig` dataclass), accessible from the CLI via `--dataset.camera_encoder_config.<field>`.
| Parameter | CLI Flag | Type | Default | Description |
| ----------------------- | --------------------------------------------- | ------------- | ------------- | ------------------------------------------------------------------- |
| `streaming_encoding` | `--dataset.streaming_encoding` | `bool` | `True` | Enable real-time encoding during capture |
| `vcodec` | `--dataset.camera_encoder_config.vcodec` | `str` | `"libsvtav1"` | Video codec. `"auto"` detects best HW encoder |
| `pix_fmt` | `--dataset.camera_encoder_config.pix_fmt` | `str` | `"yuv420p"` | Pixel format |
| `g` | `--dataset.camera_encoder_config.g` | `int \| None` | `2` | GOP size (keyframe interval) |
| `crf` | `--dataset.camera_encoder_config.crf` | `int \| None` | `30` | Quality level (mapped to codec-specific parameter) |
| `preset` | `--dataset.camera_encoder_config.preset` | `int \| None` | `12` | Speed preset (libsvtav1 only, 0 = slowest … 13 = fastest) |
| `fast_decode` | `--dataset.camera_encoder_config.fast_decode` | `int` | `0` | Fast-decode tuning level |
| `encoder_threads` | `--dataset.encoder_threads` | `int \| None` | `None` (auto) | Threads per encoder instance (global). `None` lets the codec decide |
| `encoder_queue_maxsize` | `--dataset.encoder_queue_maxsize` | `int` | `60` | Max buffered frames per camera (~2s at 30fps). Consumes RAM |
> [!TIP]
> Not all parameters apply to every codec. `VideoEncoderConfig` will warn at startup if you set a parameter that your chosen codec ignores (e.g. `preset` with `h264_nvenc`).
## 3. Performance Considerations
@@ -40,7 +50,7 @@ Streaming encoding means the CPU is encoding video **during** the capture loop,
### `encoder_threads` Tuning
This parameter controls how many threads each encoder instance uses internally:
This parameter (`--dataset.encoder_threads`) controls how many threads each encoder instance uses internally:
- **Higher values** (e.g., 4-5): Faster encoding, but uses more CPU cores per camera. Good for high-end systems with many cores.
- **Lower values** (e.g., 1-2): Less CPU per camera, freeing cores for capture and visualization. Good for low-res images and capable CPUs.
@@ -82,15 +92,15 @@ Use HW encoding when:
### Available HW Encoders
| Encoder | Platform | Hardware | CLI Value |
| ------------------- | ------------- | ------------------------------------------------------------------------------------------------ | ------------------------------------ |
| `h264_videotoolbox` | macOS | Apple Silicon / Intel | `--dataset.vcodec=h264_videotoolbox` |
| `hevc_videotoolbox` | macOS | Apple Silicon / Intel | `--dataset.vcodec=hevc_videotoolbox` |
| `h264_nvenc` | Linux/Windows | NVIDIA GPU | `--dataset.vcodec=h264_nvenc` |
| `hevc_nvenc` | Linux/Windows | NVIDIA GPU | `--dataset.vcodec=hevc_nvenc` |
| `h264_vaapi` | Linux | Intel/AMD GPU | `--dataset.vcodec=h264_vaapi` |
| `h264_qsv` | Linux/Windows | Intel Quick Sync | `--dataset.vcodec=h264_qsv` |
| `auto` | Any | Probes the system for available HW encoders. Falls back to `libsvtav1` if no HW encoder is found | `--dataset.vcodec=auto` |
| Encoder | Platform | Hardware | CLI Value |
| ------------------- | ------------- | ------------------------------------------------------------------------------------------------ | ---------------------------------------------------------- |
| `h264_videotoolbox` | macOS | Apple Silicon / Intel | `--dataset.camera_encoder_config.vcodec=h264_videotoolbox` |
| `hevc_videotoolbox` | macOS | Apple Silicon / Intel | `--dataset.camera_encoder_config.vcodec=hevc_videotoolbox` |
| `h264_nvenc` | Linux/Windows | NVIDIA GPU | `--dataset.camera_encoder_config.vcodec=h264_nvenc` |
| `hevc_nvenc` | Linux/Windows | NVIDIA GPU | `--dataset.camera_encoder_config.vcodec=hevc_nvenc` |
| `h264_vaapi` | Linux | Intel/AMD GPU | `--dataset.camera_encoder_config.vcodec=h264_vaapi` |
| `h264_qsv` | Linux/Windows | Intel Quick Sync | `--dataset.camera_encoder_config.vcodec=h264_qsv` |
| `auto` | Any | Probes the system for available HW encoders. Falls back to `libsvtav1` if no HW encoder is found | `--dataset.camera_encoder_config.vcodec=auto` |
> [!NOTE]
> In order to use the HW accelerated encoders you might need to upgrade your GPU drivers.
@@ -100,15 +110,15 @@ Use HW encoding when:
## 5. Troubleshooting
| Symptom | Likely Cause | Fix |
| ------------------------------------------------------------------ | -------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| System freezes or choppy robot movement or Rerun visualization lag | CPU starved (100% load usage) | Close other apps, reduce encoding throughput, lower `encoder_threads`, use `h264`, use `display_data=False`. If the CPU continues to be at 100% then it might be insufficient for your setup, consider `--dataset.streaming_encoding=false` or HW encoding (`--dataset.vcodec=auto`) |
| "Encoder queue full" warnings or dropped frames in dataset | Encoder can't keep up (Queue overflow) | If CPU is not at 100%: Increase `encoder_threads`, increase `encoder_queue_maxsize` or use HW encoding (`--dataset.vcodec=auto`). |
| High RAM usage | Queue filling faster than encoding | `encoder_threads` too low or CPU insufficient. Reduce `encoder_queue_maxsize` or use HW encoding |
| Large video files | Using HW encoder or H.264 | Expected trade-off. Switch to `libsvtav1` if CPU allows |
| `save_episode()` still slow | `streaming_encoding` is `False` | Set `--dataset.streaming_encoding=true` |
| Encoder thread crash | Codec not available or invalid settings | Check `vcodec` is installed, try `--dataset.vcodec=auto` |
| Recorded dataset is missing frames | CPU/GPU starvation or occasional load spikes | If ~5% of frames are missing, your system is likely overloaded — follow the recommendations above. If fewer frames are missing (~2%), they are probably due to occasional transient load spikes (often at startup) and can be considered expected. |
| Symptom | Likely Cause | Fix |
| ------------------------------------------------------------------ | -------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| System freezes or choppy robot movement or Rerun visualization lag | CPU starved (100% load usage) | Close other apps, reduce encoding throughput, lower `encoder_threads`, use `h264`, use `display_data=False`. If the CPU continues to be at 100% then it might be insufficient for your setup, consider `--dataset.streaming_encoding=false` or HW encoding (`--dataset.camera_encoder_config.vcodec=auto`) |
| "Encoder queue full" warnings or dropped frames in dataset | Encoder can't keep up (Queue overflow) | If CPU is not at 100%: Increase `encoder_threads`, increase `encoder_queue_maxsize` or use HW encoding (`--dataset.camera_encoder_config.vcodec=auto`). |
| High RAM usage | Queue filling faster than encoding | `encoder_threads` too low or CPU insufficient. Reduce `encoder_queue_maxsize` or use HW encoding |
| Large video files | Using HW encoder or H.264 | Expected trade-off. Switch to `libsvtav1` if CPU allows |
| `save_episode()` still slow | `streaming_encoding` is `False` | Set `--dataset.streaming_encoding=true` |
| Encoder thread crash | Codec not available or invalid settings | Check `vcodec` is installed, try `--dataset.camera_encoder_config.vcodec=auto` |
| Recorded dataset is missing frames | CPU/GPU starvation or occasional load spikes | If ~5% of frames are missing, your system is likely overloaded — follow the recommendations above. If fewer frames are missing (~2%), they are probably due to occasional transient load spikes (often at startup) and can be considered expected. |
## 6. Recommended Configurations
@@ -146,10 +156,10 @@ On very constrained systems, streaming encoding may compete too heavily with the
# 2camsx 640x480x3 @30fps: Requires some tuning.
# Use H.264, disable streaming, consider batching encoding
lerobot-record --dataset.vcodec=h264 --dataset.streaming_encoding=false ...
lerobot-record --dataset.camera_encoder_config.vcodec=h264 --dataset.streaming_encoding=false ...
```
## 7. Closing note
Performance ultimately depends on your exact setup — frames-per-second, resolution, CPU cores and load, available memory, episode length, and the encoder you choose. Always test with your target workload, be mindful about your CPU & system capabilities and tune `encoder_threads`, `encoder_queue_maxsize`, and
`vcodec` reasonably. That said, a common practical configuration (for many applications) is three cameras at 640×480x3 @30fps; this usually runs fine with the default streaming video encoding settings in modern systems. Always verify your recorded dataset is healthy by comparing the video duration to the CLI episode duration and confirming the row count equals FPS × CLI duration.
`camera_encoder_config.vcodec` reasonably. That said, a common practical configuration (for many applications) is three cameras at 640×480x3 @30fps; this usually runs fine with the default streaming video encoding settings in modern systems. Always verify your recorded dataset is healthy by comparing the video duration to the CLI episode duration and confirming the row count equals FPS × CLI duration.
+12 -9
View File
@@ -117,10 +117,10 @@ lerobot-edit-dataset \
--repo_id lerobot/pusht_image \
--operation.type convert_image_to_video \
--operation.output_dir outputs/pusht_video \
--operation.vcodec libsvtav1 \
--operation.pix_fmt yuv420p \
--operation.g 2 \
--operation.crf 30
--operation.camera_encoder_config.vcodec libsvtav1 \
--operation.camera_encoder_config.pix_fmt yuv420p \
--operation.camera_encoder_config.g 2 \
--operation.camera_encoder_config.crf 30
# Convert only specific episodes
lerobot-edit-dataset \
@@ -147,11 +147,14 @@ lerobot-edit-dataset \
**Parameters:**
- `output_dir`: Custom output directory (optional - by default uses `new_repo_id` or `{repo_id}_video`)
- `vcodec`: Video codec to use - options: `h264`, `hevc`, `libsvtav1` (default: `libsvtav1`)
- `pix_fmt`: Pixel format - options: `yuv420p`, `yuv444p` (default: `yuv420p`)
- `g`: Group of pictures (GOP) size - lower values give better quality but larger files (default: 2)
- `crf`: Constant rate factor - lower values give better quality but larger files, 0 is lossless (default: 30)
- `fast_decode`: Fast decode tuning option (default: 0)
- `camera_encoder_config`: Video encoder settings — all sub-fields accessible via `--operation.camera_encoder_config.<field>`:
- `vcodec`: Video codec — `h264`, `hevc`, `libsvtav1`, `auto`, or hardware codecs (default: `libsvtav1`)
- `pix_fmt`: Pixel format — `yuv420p`, `yuv444p` (default: `yuv420p`)
- `g`: GOP size — lower values give better quality but larger files (default: 2)
- `crf`: Quality level — lower is better, 0 is lossless (default: 30)
- `preset`: Speed preset, libsvtav1 only (default: 12)
- `fast_decode`: Fast-decode tuning (default: 0)
- `encoder_threads`: Threads per encoder instance — global setting, separate from `camera_encoder_config` (default: None)
- `episode_indices`: List of specific episodes to convert (default: all episodes)
- `num_workers`: Number of parallel workers for processing (default: 4)
+4 -4
View File
@@ -220,7 +220,7 @@ REAL_DIM = 12
# Postprocessing: Trim 20D predictions to 12D for deployment
```
See the [action_hub.py](/home/jade_choghari/robot/lerobot/src/lerobot/policies/xvla/action_hub.py) implementation for details.
See the [action_hub.py](https://github.com/huggingface/lerobot/blob/main/src/lerobot/policies/xvla/action_hub.py) implementation for details.
#### Auto Action Mode (Recommended)
@@ -519,9 +519,9 @@ If you use X-VLA in your research, please cite:
- [X-VLA Paper](https://arxiv.org/pdf/2510.10274)
- [LeRobot Documentation](https://github.com/huggingface/lerobot)
- [Action Registry Implementation](https://github.com/huggingface/lerobot/src/lerobot/policies/xvla/action_hub.py)
- [Processor Implementation](https://github.com/huggingface/lerobot/src/lerobot/policies/xvla/processor_xvla.py)
- [Model Configuration](https://github.com/huggingface/lerobot/src/lerobot/policies/xvla/configuration_xvla.py)
- [Action Registry Implementation](https://github.com/huggingface/lerobot/blob/main/src/lerobot/policies/xvla/action_hub.py)
- [Processor Implementation](https://github.com/huggingface/lerobot/blob/main/src/lerobot/policies/xvla/processor_xvla.py)
- [Model Configuration](https://github.com/huggingface/lerobot/blob/main/src/lerobot/policies/xvla/configuration_xvla.py)
## Contributing
+1 -1
View File
@@ -69,7 +69,7 @@ class ComputeProgressShards(PipelineStep):
import torch
from tqdm import tqdm
from lerobot.policies.sarm.compute_rabc_weights import (
from lerobot.rewards.sarm.compute_rabc_weights import (
generate_all_frame_indices,
interpolate_progress,
load_sarm_resources,
+1 -1
View File
@@ -10,7 +10,7 @@ from lerobot.datasets import LeRobotDataset
from lerobot.envs.configs import HILSerlProcessorConfig, HILSerlRobotEnvConfig
from lerobot.policies import SACConfig
from lerobot.policies.sac.modeling_sac import SACPolicy
from lerobot.policies.sac.reward_model.modeling_classifier import Classifier
from lerobot.rewards.classifier.modeling_classifier import Classifier
from lerobot.rl.buffer import ReplayBuffer
from lerobot.rl.gym_manipulator import make_robot_env
from lerobot.robots.so_follower import SO100FollowerConfig
@@ -1,7 +1,7 @@
import torch
from lerobot.datasets import LeRobotDataset
from lerobot.policies import RewardClassifierConfig, make_policy, make_pre_post_processors
from lerobot.rewards import RewardClassifierConfig, make_reward_model, make_reward_pre_post_processors
def main():
@@ -22,10 +22,10 @@ def main():
model_name="microsoft/resnet-18",
)
# Make policy, preprocessor, and optimizer
policy = make_policy(config, ds_meta=dataset.meta)
optimizer = config.get_optimizer_preset().build(policy.parameters())
preprocessor, _ = make_pre_post_processors(policy_cfg=config, dataset_stats=dataset.meta.stats)
# Make reward model, preprocessor, and optimizer
reward_model = make_reward_model(config, dataset_stats=dataset.meta.stats)
optimizer = config.get_optimizer_preset().build(reward_model.parameters())
preprocessor, _ = make_reward_pre_post_processors(config, dataset_stats=dataset.meta.stats)
classifier_id = "<user>/reward_classifier_hil_serl_example"
@@ -42,7 +42,7 @@ def main():
batch = preprocessor(batch)
# Forward pass
loss, output_dict = policy.forward(batch)
loss, output_dict = reward_model.forward(batch)
# Backward pass and optimization
optimizer.zero_grad()
@@ -58,8 +58,8 @@ def main():
print("Training finished!")
# You can now save the trained policy.
policy.push_to_hub(classifier_id)
# You can now save the trained reward model.
reward_model.push_to_hub(classifier_id)
if __name__ == "__main__":
@@ -17,6 +17,7 @@ Provides the RealSenseCamera class for capturing frames from Intel RealSense cam
"""
import logging
import sys
import time
from threading import Event, Lock, Thread
from typing import TYPE_CHECKING, Any
@@ -41,6 +42,7 @@ from ..utils import get_cv2_rotation
from .configuration_realsense import RealSenseCameraConfig
logger = logging.getLogger(__name__)
pkg_name = "pyrealsense2-macosx" if sys.platform == "darwin" else "pyrealsense2"
class RealSenseCamera(Camera):
@@ -114,7 +116,7 @@ class RealSenseCamera(Camera):
Args:
config: The configuration settings for the camera.
"""
require_package("pyrealsense2", extra="intelrealsense")
require_package(pkg_name, extra="intelrealsense", import_name="pyrealsense2")
super().__init__(config)
self.config = config
@@ -131,6 +133,9 @@ class RealSenseCamera(Camera):
self.rs_pipeline: rs.pipeline | None = None
self.rs_profile: rs.pipeline_profile | None = None
# Meters per uint16 unit on the depth stream. Queried from the device
# at connect() time. Typical D-series value is 0.001 (= 1 mm/unit).
self.depth_scale: float | None = None
self.thread: Thread | None = None
self.stop_event: Event | None = None
@@ -188,6 +193,17 @@ class RealSenseCamera(Camera):
) from e
self._configure_capture_settings()
# Query depth scale (meters per uint16 unit) when depth is enabled so
# consumers can convert the raw z16 stream to metric distances.
if self.use_depth and self.rs_profile is not None:
try:
depth_sensor = self.rs_profile.get_device().first_depth_sensor()
self.depth_scale = float(depth_sensor.get_depth_scale())
except RuntimeError as e:
logger.warning(f"{self}: failed to query depth scale ({e}); falling back to 0.001 m/unit.")
self.depth_scale = 0.001
self._start_read_thread()
# NOTE(Steven/Caroline): Enforcing at least one second of warmup as RS cameras need a bit of time before the first read. If we don't wait, the first read from the warmup will raise.
@@ -530,7 +546,6 @@ class RealSenseCamera(Camera):
self.latest_timestamp = None
self.new_frame_event.clear()
# NOTE(Steven): Missing implementation for depth for now
@check_if_not_connected
def async_read(self, timeout_ms: float = 200) -> NDArray[Any]:
"""
@@ -573,7 +588,6 @@ class RealSenseCamera(Camera):
return frame
# NOTE(Steven): Missing implementation for depth for now
@check_if_not_connected
def read_latest(self, max_age_ms: int = 500) -> NDArray[Any]:
"""Return the most recent (color) frame captured immediately (Peeking).
@@ -609,6 +623,78 @@ class RealSenseCamera(Camera):
return frame
@check_if_not_connected
def async_read_depth(self, timeout_ms: float = 200) -> NDArray[Any]:
"""Read the latest depth frame asynchronously, in metric meters.
Mirrors :meth:`async_read` but returns the depth stream rather than the
color stream. Output is ``np.uint16`` of shape ``(H, W)``.
Raises:
DeviceNotConnectedError: If the camera is not connected.
RuntimeError: If ``use_depth`` is ``False`` for this camera, or if
the background read thread is not running.
TimeoutError: If no frame becomes available within ``timeout_ms``.
"""
if not self.use_depth:
raise RuntimeError(
f"{self}: cannot read depth — camera was configured with use_depth=False."
)
if self.thread is None or not self.thread.is_alive():
raise RuntimeError(f"{self} read thread is not running.")
if not self.new_frame_event.wait(timeout=timeout_ms / 1000.0):
raise TimeoutError(
f"Timed out waiting for depth frame from camera {self} after {timeout_ms} ms."
)
with self.frame_lock:
depth_frame = self.latest_depth_frame
self.new_frame_event.clear()
if depth_frame is None:
raise RuntimeError(f"Internal error: Event set but no depth frame available for {self}.")
return depth_frame
@check_if_not_connected
def read_latest_depth(self, max_age_ms: int = 500) -> NDArray[Any]:
"""Return the most recent depth frame in metric meters (peeking).
Non-blocking counterpart of :meth:`read_latest` for the depth stream.
Output is ``np.float32`` of shape ``(H, W)`` in meters.
Raises:
DeviceNotConnectedError: If the camera is not connected.
RuntimeError: If ``use_depth`` is ``False`` for this camera, or if
no depth frame has been captured yet.
TimeoutError: If the latest depth frame is older than ``max_age_ms``.
"""
if not self.use_depth:
raise RuntimeError(
f"{self}: cannot read depth — camera was configured with use_depth=False."
)
if self.thread is None or not self.thread.is_alive():
raise RuntimeError(f"{self} read thread is not running.")
with self.frame_lock:
depth_frame = self.latest_depth_frame
timestamp = self.latest_timestamp
if depth_frame is None or timestamp is None:
raise RuntimeError(f"{self} has not captured any depth frames yet.")
age_ms = (time.perf_counter() - timestamp) * 1e3
if age_ms > max_age_ms:
raise TimeoutError(
f"{self} latest depth frame is too old: {age_ms:.1f} ms (max allowed: {max_age_ms} ms)."
)
return depth_frame
def disconnect(self) -> None:
"""
Disconnects from the camera, stops the pipeline, and cleans up resources.
@@ -632,6 +718,8 @@ class RealSenseCamera(Camera):
self.rs_pipeline = None
self.rs_profile = None
self.depth_scale = None
with self.frame_lock:
self.latest_color_frame = None
self.latest_depth_frame = None
+5 -1
View File
@@ -41,8 +41,12 @@ def cfg_to_group(
return tag
return tag[:max_tag_length]
if cfg.is_reward_model_training:
trainable_tag = f"reward_model:{cfg.reward_model.type}"
else:
trainable_tag = f"policy:{cfg.policy.type}"
lst = [
f"policy:{cfg.policy.type}",
trainable_tag,
f"seed:{cfg.seed}",
]
if cfg.dataset is not None:
+2 -2
View File
@@ -17,7 +17,7 @@
from dataclasses import dataclass, field
from lerobot.transforms import ImageTransformsConfig
from lerobot.utils.import_utils import get_safe_default_codec
from lerobot.utils.import_utils import get_safe_default_video_backend
@dataclass
@@ -34,7 +34,7 @@ class DatasetConfig:
image_transforms: ImageTransformsConfig = field(default_factory=ImageTransformsConfig)
revision: str | None = None
use_imagenet_stats: bool = True
video_backend: str = field(default_factory=get_safe_default_codec)
video_backend: str = field(default_factory=get_safe_default_video_backend)
# When True, video frames are returned as uint8 tensors (0-255) instead of float32 (0.0-1.0).
# This reduces memory and speeds up DataLoader IPC. The training pipeline handles the conversion.
return_uint8: bool = False
+163
View File
@@ -0,0 +1,163 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import abc
import builtins
import json
import logging
import os
import tempfile
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, TypeVar
import draccus
from huggingface_hub import hf_hub_download
from huggingface_hub.constants import CONFIG_NAME
from huggingface_hub.errors import HfHubHTTPError
from lerobot.configs.types import PolicyFeature
from lerobot.optim.optimizers import OptimizerConfig
from lerobot.optim.schedulers import LRSchedulerConfig
from lerobot.utils.device_utils import auto_select_torch_device, is_torch_device_available
from lerobot.utils.hub import HubMixin
T = TypeVar("T", bound="RewardModelConfig")
logger = logging.getLogger(__name__)
@dataclass
class RewardModelConfig(draccus.ChoiceRegistry, HubMixin, abc.ABC):
"""Base configuration for reward models.
Args:
input_features: A dictionary defining the PolicyFeature of the input data for the reward. The key represents
the input data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
output_features: A dictionary defining the PolicyFeature of the output data for the reward. The key represents
the output data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
"""
# Reuses PolicyFeature
input_features: dict[str, PolicyFeature] = field(default_factory=dict)
output_features: dict[str, PolicyFeature] = field(default_factory=dict)
device: str | None = None
pretrained_path: str | None = None
push_to_hub: bool = False
repo_id: str | None = None
# Hub metadata
license: str | None = None
tags: list[str] | None = None
private: bool | None = None
def __post_init__(self) -> None:
if not self.device or not is_torch_device_available(self.device):
auto_device = auto_select_torch_device()
logger.warning(f"Device '{self.device}' is not available. Switching to '{auto_device}'.")
self.device = auto_device.type
@property
def type(self) -> str:
choice_name = self.get_choice_name(self.__class__)
if not isinstance(choice_name, str):
raise TypeError(f"Expected string from get_choice_name, got {type(choice_name)}")
return choice_name
@property
def observation_delta_indices(self) -> list | None: # type: ignore[type-arg]
return None
@property
def action_delta_indices(self) -> list | None: # type: ignore[type-arg]
return None
@property
def reward_delta_indices(self) -> list | None: # type: ignore[type-arg]
return None
@abc.abstractmethod
def get_optimizer_preset(self) -> OptimizerConfig:
raise NotImplementedError
def get_scheduler_preset(self) -> LRSchedulerConfig | None:
return None
def validate_features(self) -> None:
pass
def _save_pretrained(self, save_directory: Path) -> None:
with open(save_directory / CONFIG_NAME, "w") as f, draccus.config_type("json"):
draccus.dump(self, f, indent=4)
@classmethod
def from_pretrained(
cls: builtins.type[T],
pretrained_name_or_path: str | Path,
*,
force_download: bool = False,
resume_download: bool | None = None,
proxies: dict[Any, Any] | None = None,
token: str | bool | None = None,
cache_dir: str | Path | None = None,
local_files_only: bool = False,
revision: str | None = None,
**reward_kwargs: Any,
) -> T:
model_id = str(pretrained_name_or_path)
config_file: str | None = None
if Path(model_id).is_dir():
if CONFIG_NAME in os.listdir(model_id):
config_file = os.path.join(model_id, CONFIG_NAME)
else:
logger.error(f"{CONFIG_NAME} not found in {Path(model_id).resolve()}")
else:
try:
config_file = hf_hub_download(
repo_id=model_id,
filename=CONFIG_NAME,
revision=revision,
cache_dir=cache_dir,
force_download=force_download,
proxies=proxies,
resume_download=resume_download,
token=token,
local_files_only=local_files_only,
)
except HfHubHTTPError as e:
raise FileNotFoundError(
f"{CONFIG_NAME} not found on the HuggingFace Hub in {model_id}"
) from e
if config_file is None:
raise FileNotFoundError(f"{CONFIG_NAME} not found in {model_id}")
# HACK: Parse the original config to get the config subclass, so that we can
# apply cli overrides.
with draccus.config_type("json"):
orig_config = draccus.parse(cls, config_file, args=[])
with open(config_file) as f:
config = json.load(f)
config.pop("type", None)
with tempfile.NamedTemporaryFile("w+", delete=False, suffix=".json") as f:
json.dump(config, f)
config_file = f.name
cli_overrides = reward_kwargs.pop("cli_overrides", [])
with draccus.config_type("json"):
return draccus.parse(orig_config.__class__, config_file, args=cli_overrides)
+87 -29
View File
@@ -13,7 +13,9 @@
# limitations under the License.
import builtins
import datetime as dt
import json
import os
import tempfile
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
@@ -26,18 +28,57 @@ from lerobot import envs
from lerobot.configs import parser
from lerobot.optim import LRSchedulerConfig, OptimizerConfig
from lerobot.utils.hub import HubMixin
from lerobot.utils.sample_weighting import SampleWeightingConfig
from .default import DatasetConfig, EvalConfig, PeftConfig, WandBConfig
from .policies import PreTrainedConfig
from .rewards import RewardModelConfig
TRAIN_CONFIG_NAME = "train_config.json"
def _migrate_legacy_rabc_fields(config: dict[str, Any]) -> dict[str, Any] | None:
"""Return migrated payload for legacy RA-BC fields, or None when no migration is needed."""
legacy_fields = (
"use_rabc",
"rabc_progress_path",
"rabc_kappa",
"rabc_epsilon",
"rabc_head_mode",
)
if not any(key in config for key in legacy_fields):
return None
migrated_config = dict(config)
use_rabc = bool(migrated_config.pop("use_rabc", False))
rabc_progress_path = migrated_config.pop("rabc_progress_path", None)
rabc_kappa = migrated_config.pop("rabc_kappa", None)
rabc_epsilon = migrated_config.pop("rabc_epsilon", None)
rabc_head_mode = migrated_config.pop("rabc_head_mode", None)
# New configs may already define sample_weighting explicitly. In that case,
# legacy fields are ignored after being stripped from the payload.
if migrated_config.get("sample_weighting") is None and use_rabc:
sample_weighting: dict[str, Any] = {"type": "rabc"}
if rabc_progress_path is not None:
sample_weighting["progress_path"] = rabc_progress_path
if rabc_kappa is not None:
sample_weighting["kappa"] = rabc_kappa
if rabc_epsilon is not None:
sample_weighting["epsilon"] = rabc_epsilon
if rabc_head_mode is not None:
sample_weighting["head_mode"] = rabc_head_mode
migrated_config["sample_weighting"] = sample_weighting
return migrated_config
@dataclass
class TrainPipelineConfig(HubMixin):
dataset: DatasetConfig
env: envs.EnvConfig | None = None
policy: PreTrainedConfig | None = None
reward_model: RewardModelConfig | None = None
# Set `dir` to where you would like to save all of the run outputs. If you run another training session
# with the same value for `dir` its contents will be overwritten unless you set `resume` to true.
output_dir: Path | None = None
@@ -72,27 +113,41 @@ class TrainPipelineConfig(HubMixin):
wandb: WandBConfig = field(default_factory=WandBConfig)
peft: PeftConfig | None = None
# RA-BC (Reward-Aligned Behavior Cloning) parameters
use_rabc: bool = False # Enable reward-weighted training
rabc_progress_path: str | None = None # Path to precomputed SARM progress parquet file
rabc_kappa: float = 0.01 # Hard threshold for high-quality samples
rabc_epsilon: float = 1e-6 # Small constant for numerical stability
rabc_head_mode: str | None = "sparse" # For dual-head models: "sparse" or "dense"
# Sample weighting configuration (e.g., for RA-BC training)
sample_weighting: SampleWeightingConfig | None = None
# Rename map for the observation to override the image and state keys
rename_map: dict[str, str] = field(default_factory=dict)
checkpoint_path: Path | None = field(init=False, default=None)
@property
def is_reward_model_training(self) -> bool:
"""True when the config targets a reward model rather than a policy."""
return self.reward_model is not None
@property
def trainable_config(self) -> PreTrainedConfig | RewardModelConfig:
"""Return whichever config (policy or reward_model) is active."""
if self.is_reward_model_training:
return self.reward_model # type: ignore[return-value]
return self.policy # type: ignore[return-value]
def validate(self) -> None:
# HACK: We parse again the cli args here to get the pretrained paths if there was some.
policy_path = parser.get_path_arg("policy")
if policy_path:
# Only load the policy config
reward_model_path = parser.get_path_arg("reward_model")
if reward_model_path:
cli_overrides = parser.get_cli_overrides("reward_model")
self.reward_model = RewardModelConfig.from_pretrained(
reward_model_path, cli_overrides=cli_overrides
)
self.reward_model.pretrained_path = str(Path(reward_model_path))
elif policy_path:
cli_overrides = parser.get_cli_overrides("policy")
self.policy = PreTrainedConfig.from_pretrained(policy_path, cli_overrides=cli_overrides)
self.policy.pretrained_path = Path(policy_path)
elif self.resume:
# The entire train config is already loaded, we just need to get the checkpoint dir
config_path = parser.parse_arg("config_path")
if not config_path:
raise ValueError(
@@ -108,18 +163,22 @@ class TrainPipelineConfig(HubMixin):
policy_dir = Path(config_path).parent
if self.policy is not None:
self.policy.pretrained_path = policy_dir
if self.reward_model is not None:
self.reward_model.pretrained_path = str(policy_dir)
self.checkpoint_path = policy_dir.parent
if self.policy is None:
if self.policy is None and self.reward_model is None:
raise ValueError(
"Policy is not configured. Please specify a pretrained policy with `--policy.path`."
"Neither policy nor reward_model is configured. "
"Please specify one with `--policy.path` or `--reward_model.path`."
)
active_cfg = self.trainable_config
if not self.job_name:
if self.env is None:
self.job_name = f"{self.policy.type}"
self.job_name = f"{active_cfg.type}"
else:
self.job_name = f"{self.env.type}_{self.policy.type}"
self.job_name = f"{self.env.type}_{active_cfg.type}"
if not self.resume and isinstance(self.output_dir, Path) and self.output_dir.is_dir():
raise FileExistsError(
@@ -137,26 +196,16 @@ class TrainPipelineConfig(HubMixin):
if not self.use_policy_training_preset and (self.optimizer is None or self.scheduler is None):
raise ValueError("Optimizer and Scheduler must be set when the policy presets are not used.")
elif self.use_policy_training_preset and not self.resume:
self.optimizer = self.policy.get_optimizer_preset()
self.scheduler = self.policy.get_scheduler_preset()
self.optimizer = active_cfg.get_optimizer_preset()
self.scheduler = active_cfg.get_scheduler_preset()
if self.policy.push_to_hub and not self.policy.repo_id:
raise ValueError(
"'policy.repo_id' argument missing. Please specify it to push the model to the hub."
)
if self.use_rabc and not self.rabc_progress_path:
# Auto-detect from dataset path
repo_id = self.dataset.repo_id
if self.dataset.root:
self.rabc_progress_path = str(Path(self.dataset.root) / "sarm_progress.parquet")
else:
self.rabc_progress_path = f"hf://datasets/{repo_id}/sarm_progress.parquet"
if hasattr(active_cfg, "push_to_hub") and active_cfg.push_to_hub and not active_cfg.repo_id:
raise ValueError("'repo_id' argument missing. Please specify it to push the model to the hub.")
@classmethod
def __get_path_fields__(cls) -> list[str]:
"""This enables the parser to load config from the policy using `--policy.path=local/dir`"""
return ["policy"]
"""Keys for draccus pretrained-path loading."""
return ["policy", "reward_model"]
def to_dict(self) -> dict[str, Any]:
return draccus.encode(self) # type: ignore[no-any-return] # because of the third-party library draccus uses Any as the return type
@@ -207,6 +256,15 @@ class TrainPipelineConfig(HubMixin):
) from e
cli_args = kwargs.pop("cli_args", [])
if config_file is not None:
with open(config_file) as f:
config = json.load(f)
migrated_config = _migrate_legacy_rabc_fields(config)
if migrated_config is not None:
with tempfile.NamedTemporaryFile("w+", delete=False, suffix=".json") as f:
json.dump(migrated_config, f)
config_file = f.name
with draccus.config_type("json"):
return draccus.parse(cls, config_file, args=cli_args)
+19 -1
View File
@@ -40,10 +40,21 @@ from .io_utils import load_episodes, write_stats
from .lerobot_dataset import LeRobotDataset
from .multi_dataset import MultiLeRobotDataset
from .pipeline_features import aggregate_pipeline_dataset_features, create_initial_features
from .pyav_utils import (
check_video_encoder_config_pyav,
detect_available_encoders_pyav,
get_codec,
)
from .sampler import EpisodeAwareSampler
from .streaming_dataset import StreamingLeRobotDataset
from .utils import DEFAULT_EPISODES_PATH, create_lerobot_dataset_card
from .video_utils import VideoEncodingManager
from .video_utils import (
DepthEncoderConfig,
VideoEncoderConfig,
VideoEncodingManager,
camera_encoder_defaults,
depth_encoder_defaults,
)
# NOTE: Low-level I/O functions (cast_stats_to_numpy, get_parquet_file_size_in_mb, etc.)
# and legacy migration constants are intentionally NOT re-exported here.
@@ -58,15 +69,22 @@ __all__ = [
"LeRobotDatasetMetadata",
"MultiLeRobotDataset",
"StreamingLeRobotDataset",
"DepthEncoderConfig",
"VideoEncoderConfig",
"VideoEncodingManager",
"camera_encoder_defaults",
"depth_encoder_defaults",
"add_features",
"aggregate_datasets",
"aggregate_pipeline_dataset_features",
"aggregate_stats",
"check_video_encoder_config_pyav",
"convert_image_to_video_dataset",
"create_initial_features",
"create_lerobot_dataset_card",
"delete_episodes",
"detect_available_encoders_pyav",
"get_codec",
"get_feature_stats",
"load_episodes",
"make_dataset",
+14 -18
View File
@@ -97,8 +97,8 @@ def update_data_df(df, src_meta, dst_meta):
pd.DataFrame: Updated DataFrame with adjusted indices.
"""
df["episode_index"] = df["episode_index"] + dst_meta.info["total_episodes"]
df["index"] = df["index"] + dst_meta.info["total_frames"]
df["episode_index"] = df["episode_index"] + dst_meta.info.total_episodes
df["index"] = df["index"] + dst_meta.info.total_frames
src_task_names = src_meta.tasks.index.take(df["task_index"].to_numpy())
df["task_index"] = dst_meta.tasks.loc[src_task_names, "task_index"].to_numpy()
@@ -225,9 +225,9 @@ def update_meta_data(
# Clean up temporary columns
df = df.drop(columns=["_orig_chunk", "_orig_file"])
df["dataset_from_index"] = df["dataset_from_index"] + dst_meta.info["total_frames"]
df["dataset_to_index"] = df["dataset_to_index"] + dst_meta.info["total_frames"]
df["episode_index"] = df["episode_index"] + dst_meta.info["total_episodes"]
df["dataset_from_index"] = df["dataset_from_index"] + dst_meta.info.total_frames
df["dataset_to_index"] = df["dataset_to_index"] + dst_meta.info.total_frames
df["episode_index"] = df["episode_index"] + dst_meta.info.total_episodes
return df
@@ -237,8 +237,8 @@ def aggregate_datasets(
aggr_repo_id: str,
roots: list[Path] | None = None,
aggr_root: Path | None = None,
data_files_size_in_mb: float | None = None,
video_files_size_in_mb: float | None = None,
data_files_size_in_mb: int | None = None,
video_files_size_in_mb: int | None = None,
chunk_size: int | None = None,
):
"""Aggregates multiple LeRobot datasets into a single unified dataset.
@@ -313,8 +313,8 @@ def aggregate_datasets(
# to avoid interference between different source datasets
data_idx.pop("src_to_dst", None)
dst_meta.info["total_episodes"] += src_meta.total_episodes
dst_meta.info["total_frames"] += src_meta.total_frames
dst_meta.info.total_episodes += src_meta.total_episodes
dst_meta.info.total_frames += src_meta.total_frames
finalize_aggregation(dst_meta, all_metadata)
logging.info("Aggregation complete.")
@@ -332,7 +332,6 @@ def aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chu
videos_idx: Dictionary tracking video chunk and file indices.
video_files_size_in_mb: Maximum size for video files in MB (defaults to DEFAULT_VIDEO_FILE_SIZE_IN_MB)
chunk_size: Maximum number of files per chunk (defaults to DEFAULT_CHUNK_SIZE)
Returns:
dict: Updated videos_idx with current chunk and file indices.
"""
@@ -417,6 +416,7 @@ def aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chu
concatenate_video_files(
[dst_path, src_path],
dst_path,
compatibility_check=True,
)
# Update duration of this destination file
dst_file_durations[dst_key] = current_dst_duration + src_duration
@@ -640,14 +640,10 @@ def finalize_aggregation(aggr_meta, all_metadata):
write_tasks(aggr_meta.tasks, aggr_meta.root)
logging.info("write info")
aggr_meta.info.update(
{
"total_tasks": len(aggr_meta.tasks),
"total_episodes": sum(m.total_episodes for m in all_metadata),
"total_frames": sum(m.total_frames for m in all_metadata),
"splits": {"train": f"0:{sum(m.total_episodes for m in all_metadata)}"},
}
)
aggr_meta.info.total_tasks = len(aggr_meta.tasks)
aggr_meta.info.total_episodes = sum(m.total_episodes for m in all_metadata)
aggr_meta.info.total_frames = sum(m.total_frames for m in all_metadata)
aggr_meta.info.splits = {"train": f"0:{sum(m.total_episodes for m in all_metadata)}"}
write_info(aggr_meta.info, aggr_meta.root)
logging.info("write stats")
+58 -27
View File
@@ -37,20 +37,18 @@ from .io_utils import (
load_subtasks,
load_tasks,
write_info,
write_json,
write_stats,
write_tasks,
)
from .utils import (
DEFAULT_EPISODES_PATH,
INFO_PATH,
check_version_compatibility,
get_safe_version,
has_legacy_hub_download_metadata,
is_valid_version,
update_chunk_file_indices,
)
from .video_utils import get_video_info
from .video_utils import VideoEncoderConfig, get_video_info
CODEBASE_VERSION = "v3.0"
@@ -228,7 +226,7 @@ class LeRobotDatasetMetadata:
@property
def _version(self) -> packaging.version.Version:
"""Codebase version used to create this dataset."""
return packaging.version.parse(self.info["codebase_version"])
return packaging.version.parse(self.info.codebase_version)
def get_data_file_path(self, ep_index: int) -> Path:
"""Return the relative parquet file path for the given episode index.
@@ -283,27 +281,27 @@ class LeRobotDatasetMetadata:
@property
def data_path(self) -> str:
"""Formattable string for the parquet files."""
return self.info["data_path"]
return self.info.data_path
@property
def video_path(self) -> str | None:
"""Formattable string for the video files."""
return self.info["video_path"]
return self.info.video_path
@property
def robot_type(self) -> str | None:
"""Robot type used in recording this dataset."""
return self.info["robot_type"]
return self.info.robot_type
@property
def fps(self) -> int:
"""Frames per second used during data collection."""
return self.info["fps"]
return self.info.fps
@property
def features(self) -> dict[str, dict]:
"""All features contained in the dataset."""
return self.info["features"]
return self.info.features
@property
def image_keys(self) -> list[str]:
@@ -315,6 +313,20 @@ class LeRobotDatasetMetadata:
"""Keys to access visual modalities stored as videos."""
return [key for key, ft in self.features.items() if ft["dtype"] == "video"]
@property
def depth_keys(self) -> list[str]:
"""Keys to access depth-map modalities stored as videos.
A depth video key is a feature whose ``info`` dict carries
``"video.is_depth_map": True`` (set either at creation time by the user
or after the first encoded episode by :meth:`update_video_info`).
"""
return [
key
for key, ft in self.features.items()
if ft["dtype"] == "video" and ft.get("info", {}).get("video.is_depth_map", False)
]
@property
def camera_keys(self) -> list[str]:
"""Keys to access visual modalities (regardless of their storage method)."""
@@ -333,32 +345,32 @@ class LeRobotDatasetMetadata:
@property
def total_episodes(self) -> int:
"""Total number of episodes available."""
return self.info["total_episodes"]
return self.info.total_episodes
@property
def total_frames(self) -> int:
"""Total number of frames saved in this dataset."""
return self.info["total_frames"]
return self.info.total_frames
@property
def total_tasks(self) -> int:
"""Total number of different tasks performed in this dataset."""
return self.info["total_tasks"]
return self.info.total_tasks
@property
def chunks_size(self) -> int:
"""Max number of files per chunk."""
return self.info["chunks_size"]
return self.info.chunks_size
@property
def data_files_size_in_mb(self) -> int:
"""Max size of data file in mega bytes."""
return self.info["data_files_size_in_mb"]
return self.info.data_files_size_in_mb
@property
def video_files_size_in_mb(self) -> int:
"""Max size of video file in mega bytes."""
return self.info["video_files_size_in_mb"]
return self.info.video_files_size_in_mb
def get_task_index(self, task: str) -> int | None:
"""
@@ -502,29 +514,48 @@ class LeRobotDatasetMetadata:
self._save_episode_metadata(episode_dict)
# Update info
self.info["total_episodes"] += 1
self.info["total_frames"] += episode_length
self.info["total_tasks"] = len(self.tasks)
self.info["splits"] = {"train": f"0:{self.info['total_episodes']}"}
self.info.total_episodes += 1
self.info.total_frames += episode_length
self.info.total_tasks = len(self.tasks)
self.info.splits = {"train": f"0:{self.info.total_episodes}"}
write_info(self.info, self.root)
self.stats = aggregate_stats([self.stats, episode_stats]) if self.stats is not None else episode_stats
write_stats(self.stats, self.root)
def update_video_info(self, video_key: str | None = None) -> None:
"""
def update_video_info(
self,
video_key: str | None = None,
camera_encoder_config: VideoEncoderConfig | None = None,
) -> None:
"""Populate per-feature video info in ``info.json``.
Warning: this function writes info from first episode videos, implicitly assuming that all videos have
been encoded the same way. Also, this means it assumes the first episode exists.
Args:
video_key: If provided, only update this video key. Otherwise update
all video keys in the dataset.
camera_encoder_config: Encoder configuration used to produce the
videos. When provided, its fields are recorded as
``video.<field>`` entries alongside the stream-derived
``video.*`` entries (see :func:`get_video_info`).
"""
if video_key is not None and video_key not in self.video_keys:
raise ValueError(f"Video key {video_key} not found in dataset")
video_keys = [video_key] if video_key is not None else self.video_keys
for key in video_keys:
if not self.features[key].get("info", None):
existing = self.features[key].get("info") or {}
# Repopulate when codec metadata is missing — preserves user-provided
# markers like ``video.is_depth_map`` while still recording stream
# info on the first episode.
if not existing or "video.codec" not in existing:
video_path = self.root / self.video_path.format(video_key=key, chunk_index=0, file_index=0)
self.info["features"][key]["info"] = get_video_info(video_path)
stream_info = get_video_info(video_path, camera_encoder_config=camera_encoder_config)
merged = {**existing, **stream_info}
self.info.features[key]["info"] = merged
def update_chunk_settings(
self,
@@ -546,17 +577,17 @@ class LeRobotDatasetMetadata:
if chunks_size is not None:
if chunks_size <= 0:
raise ValueError(f"chunks_size must be positive, got {chunks_size}")
self.info["chunks_size"] = chunks_size
self.info.chunks_size = chunks_size
if data_files_size_in_mb is not None:
if data_files_size_in_mb <= 0:
raise ValueError(f"data_files_size_in_mb must be positive, got {data_files_size_in_mb}")
self.info["data_files_size_in_mb"] = data_files_size_in_mb
self.info.data_files_size_in_mb = data_files_size_in_mb
if video_files_size_in_mb is not None:
if video_files_size_in_mb <= 0:
raise ValueError(f"video_files_size_in_mb must be positive, got {video_files_size_in_mb}")
self.info["video_files_size_in_mb"] = video_files_size_in_mb
self.info.video_files_size_in_mb = video_files_size_in_mb
# Update the info file on disk
write_info(self.info, self.root)
@@ -653,7 +684,7 @@ class LeRobotDatasetMetadata:
f"Features contain video keys {obj.video_keys}, but 'use_videos' is set to False. "
"Either remove video features from the features dict, or set 'use_videos=True'."
)
write_json(obj.info, obj.root / INFO_PATH)
write_info(obj.info, obj.root)
obj.revision = None
obj._pq_writer = None
obj.latest_episode = None
+28 -8
View File
@@ -32,7 +32,13 @@ from .io_utils import (
hf_transform_to_torch,
load_nested_dataset,
)
from .video_utils import decode_video_frames
from .video_utils import decode_depth_frames, decode_video_frames
from .depth_utils import (
DEFAULT_DEPTH_MIN,
DEFAULT_DEPTH_MAX,
DEFAULT_DEPTH_SHIFT,
DEFAULT_DEPTH_USE_LOG,
)
class DatasetReader:
@@ -237,17 +243,31 @@ class DatasetReader:
"""
ep = self._meta.episodes[ep_idx]
depth_keys = set(self._meta.depth_keys)
def _decode_single(vid_key: str, query_ts: list[float]) -> tuple[str, torch.Tensor]:
from_timestamp = ep[f"videos/{vid_key}/from_timestamp"]
shifted_query_ts = [from_timestamp + ts for ts in query_ts]
video_path = self.root / self._meta.get_video_file_path(ep_idx, vid_key)
frames = decode_video_frames(
video_path,
shifted_query_ts,
self._tolerance_s,
self._video_backend,
return_uint8=self._return_uint8,
)
if vid_key in depth_keys:
feature_info = self._meta.features[vid_key].get("info") or {}
frames = decode_depth_frames(
video_path,
shifted_query_ts,
self._tolerance_s,
depth_min=feature_info.get("video.depth_min", DEFAULT_DEPTH_MIN),
depth_max=feature_info.get("video.depth_max", DEFAULT_DEPTH_MAX),
shift=feature_info.get("video.shift", DEFAULT_DEPTH_SHIFT),
use_log=feature_info.get("video.use_log", DEFAULT_DEPTH_USE_LOG),
)
else:
frames = decode_video_frames(
video_path,
shifted_query_ts,
self._tolerance_s,
self._video_backend,
return_uint8=self._return_uint8,
)
return vid_key, frames.squeeze(0)
items = list(query_timestamps.items())
+55 -73
View File
@@ -62,7 +62,7 @@ from .utils import (
DEFAULT_EPISODES_PATH,
update_chunk_file_indices,
)
from .video_utils import encode_video_frames, get_video_info
from .video_utils import VideoEncoderConfig, encode_video_frames, get_video_info
def _load_episode_with_stats(src_dataset: LeRobotDataset, episode_idx: int) -> dict:
@@ -92,6 +92,7 @@ def delete_episodes(
episode_indices: list[int],
output_dir: str | Path | None = None,
repo_id: str | None = None,
camera_encoder_config: VideoEncoderConfig | None = None,
) -> LeRobotDataset:
"""Delete episodes from a LeRobotDataset and create a new dataset.
@@ -100,6 +101,7 @@ def delete_episodes(
episode_indices: List of episode indices to delete.
output_dir: Root directory where the edited dataset will be stored. If not specified, defaults to $HF_LEROBOT_HOME/repo_id. Equivalent to new_root in EditDatasetConfig.
repo_id: Edited dataset identifier. Equivalent to new_repo_id in EditDatasetConfig.
camera_encoder_config: Video encoder settings used when re-encoding video segments (default: :class:`VideoEncoderConfig()`).
"""
if not episode_indices:
raise ValueError("No episodes to delete")
@@ -132,7 +134,7 @@ def delete_episodes(
video_metadata = None
if dataset.meta.video_keys:
video_metadata = _copy_and_reindex_videos(dataset, new_meta, episode_mapping)
video_metadata = _copy_and_reindex_videos(dataset, new_meta, episode_mapping, camera_encoder_config)
data_metadata = _copy_and_reindex_data(dataset, new_meta, episode_mapping)
@@ -154,6 +156,7 @@ def split_dataset(
dataset: LeRobotDataset,
splits: dict[str, float | list[int]],
output_dir: str | Path | None = None,
camera_encoder_config: VideoEncoderConfig | None = None,
) -> dict[str, LeRobotDataset]:
"""Split a LeRobotDataset into multiple smaller datasets.
@@ -162,6 +165,7 @@ def split_dataset(
splits: Either a dict mapping split names to episode indices, or a dict mapping
split names to fractions (must sum to <= 1.0).
output_dir: Root directory where the split datasets will be stored. If not specified, defaults to $HF_LEROBOT_HOME/repo_id.
camera_encoder_config: Video encoder settings used when re-encoding video segments (default: :class:`VideoEncoderConfig()`).
Examples:
Split by specific episodes
@@ -222,7 +226,9 @@ def split_dataset(
video_metadata = None
if dataset.meta.video_keys:
video_metadata = _copy_and_reindex_videos(dataset, new_meta, episode_mapping)
video_metadata = _copy_and_reindex_videos(
dataset, new_meta, episode_mapping, camera_encoder_config
)
data_metadata = _copy_and_reindex_data(dataset, new_meta, episode_mapping)
@@ -578,8 +584,7 @@ def _keep_episodes_from_video_with_av(
output_path: Path,
episodes_to_keep: list[tuple[int, int]],
fps: float,
vcodec: str = "libsvtav1",
pix_fmt: str = "yuv420p",
camera_encoder_config: VideoEncoderConfig | None = None,
) -> None:
"""Keep only specified episodes from a video file using PyAV.
@@ -593,9 +598,10 @@ def _keep_episodes_from_video_with_av(
Ranges are half-open intervals: [start_frame, end_frame), where start_frame
is inclusive and end_frame is exclusive.
fps: Frame rate of the video.
vcodec: Video codec to use for encoding.
pix_fmt: Pixel format for output video.
camera_encoder_config: Video encoder settings (default: :class:`VideoEncoderConfig()`).
"""
if camera_encoder_config is None:
camera_encoder_config = VideoEncoderConfig()
from fractions import Fraction
import av
@@ -619,12 +625,12 @@ def _keep_episodes_from_video_with_av(
# Convert fps to Fraction for PyAV compatibility.
fps_fraction = Fraction(fps).limit_denominator(1000)
v_out = out.add_stream(vcodec, rate=fps_fraction)
v_out = out.add_stream(camera_encoder_config.vcodec, rate=fps_fraction)
# PyAV type stubs don't distinguish video streams from audio/subtitle streams.
v_out.width = v_in.codec_context.width
v_out.height = v_in.codec_context.height
v_out.pix_fmt = pix_fmt
v_out.pix_fmt = camera_encoder_config.pix_fmt
# Set time_base to match the frame rate for proper timestamp handling.
v_out.time_base = Fraction(1, int(fps))
@@ -687,8 +693,7 @@ def _copy_and_reindex_videos(
src_dataset: LeRobotDataset,
dst_meta: LeRobotDatasetMetadata,
episode_mapping: dict[int, int],
vcodec: str = "libsvtav1",
pix_fmt: str = "yuv420p",
camera_encoder_config: VideoEncoderConfig | None = None,
) -> dict[int, dict]:
"""Copy and filter video files, only re-encoding files with deleted episodes.
@@ -700,10 +705,13 @@ def _copy_and_reindex_videos(
src_dataset: Source dataset to copy from
dst_meta: Destination metadata object
episode_mapping: Mapping from old episode indices to new indices
camera_encoder_config: Video encoder settings used when re-encoding segments (default: :class:`VideoEncoderConfig()`).
Returns:
dict mapping episode index to its video metadata (chunk_index, file_index, timestamps)
"""
if camera_encoder_config is None:
camera_encoder_config = VideoEncoderConfig()
if src_dataset.meta.episodes is None:
src_dataset.meta.episodes = load_episodes(src_dataset.meta.root)
@@ -792,8 +800,7 @@ def _copy_and_reindex_videos(
dst_video_path,
episodes_to_keep_ranges,
src_dataset.meta.fps,
vcodec,
pix_fmt,
camera_encoder_config,
)
cumulative_ts = 0.0
@@ -897,14 +904,10 @@ def _copy_and_reindex_episodes_metadata(
dst_meta.finalize()
dst_meta.info.update(
{
"total_episodes": len(episode_mapping),
"total_frames": total_frames,
"total_tasks": len(dst_meta.tasks) if dst_meta.tasks is not None else 0,
"splits": {"train": f"0:{len(episode_mapping)}"},
}
)
dst_meta.info.total_episodes = len(episode_mapping)
dst_meta.info.total_frames = total_frames
dst_meta.info.total_tasks = len(dst_meta.tasks) if dst_meta.tasks is not None else 0
dst_meta.info.splits = {"train": f"0:{len(episode_mapping)}"}
write_info(dst_meta.info, dst_meta.root)
if not all_stats:
@@ -1069,21 +1072,20 @@ def _copy_episodes_metadata_and_stats(
if episodes_dir.exists():
shutil.copytree(episodes_dir, dst_episodes_dir, dirs_exist_ok=True)
dst_meta.info.update(
{
"total_episodes": src_dataset.meta.total_episodes,
"total_frames": src_dataset.meta.total_frames,
"total_tasks": src_dataset.meta.total_tasks,
"splits": src_dataset.meta.info.get("splits", {"train": f"0:{src_dataset.meta.total_episodes}"}),
}
dst_meta.info.total_episodes = src_dataset.meta.total_episodes
dst_meta.info.total_frames = src_dataset.meta.total_frames
dst_meta.info.total_tasks = src_dataset.meta.total_tasks
# Preserve original splits if available, otherwise create default
dst_meta.info.splits = (
src_dataset.meta.info.splits
if src_dataset.meta.info.splits
else {"train": f"0:{src_dataset.meta.total_episodes}"}
)
if dst_meta.video_keys and src_dataset.meta.video_keys:
for key in dst_meta.video_keys:
if key in src_dataset.meta.features:
dst_meta.info["features"][key]["info"] = src_dataset.meta.info["features"][key].get(
"info", {}
)
dst_meta.info.features[key]["info"] = src_dataset.meta.info.features[key].get("info", {})
write_info(dst_meta.info, dst_meta.root)
@@ -1269,11 +1271,7 @@ def _estimate_frame_size_via_calibration(
episode_indices: list[int],
temp_dir: Path,
fps: int,
vcodec: str,
pix_fmt: str,
g: int,
crf: int,
fast_decode: int,
camera_encoder_config: VideoEncoderConfig,
num_calibration_frames: int = 30,
) -> float:
"""Estimate MB per frame by encoding a small calibration sample.
@@ -1287,11 +1285,7 @@ def _estimate_frame_size_via_calibration(
episode_indices: List of episode indices being processed.
temp_dir: Temporary directory for calibration files.
fps: Frames per second for video encoding.
vcodec: Video codec (libsvtav1, h264, hevc).
pix_fmt: Pixel format (yuv420p, etc.).
g: GOP size (group of pictures).
crf: Constant Rate Factor (quality).
fast_decode: Fast decode tuning parameter.
camera_encoder_config: Video encoder settings used for calibration encoding.
num_calibration_frames: Number of frames to use for calibration (default: 30).
Returns:
@@ -1327,11 +1321,7 @@ def _estimate_frame_size_via_calibration(
imgs_dir=calibration_dir,
video_path=calibration_video_path,
fps=fps,
vcodec=vcodec,
pix_fmt=pix_fmt,
g=g,
crf=crf,
fast_decode=fast_decode,
camera_encoder_config=camera_encoder_config,
overwrite=True,
)
@@ -1525,7 +1515,7 @@ def modify_tasks(
write_tasks(new_task_df, root)
# Update info.json
dataset.meta.info["total_tasks"] = len(unique_tasks)
dataset.meta.info.total_tasks = len(unique_tasks)
write_info(dataset.meta.info, root)
# Reload metadata to reflect changes
@@ -1649,11 +1639,7 @@ def convert_image_to_video_dataset(
dataset: LeRobotDataset,
output_dir: Path | None = None,
repo_id: str | None = None,
vcodec: str = "libsvtav1",
pix_fmt: str = "yuv420p",
g: int = 2,
crf: int = 30,
fast_decode: int = 0,
camera_encoder_config: VideoEncoderConfig | None = None,
episode_indices: list[int] | None = None,
num_workers: int = 4,
max_episodes_per_batch: int | None = None,
@@ -1668,11 +1654,7 @@ def convert_image_to_video_dataset(
dataset: The source LeRobot dataset with images
output_dir: Root directory where the edited dataset will be stored. If not specified, defaults to $HF_LEROBOT_HOME/repo_id. Equivalent to new_root in EditDatasetConfig.
repo_id: Edited dataset identifier. Equivalent to new_repo_id in EditDatasetConfig.
vcodec: Video codec (default: libsvtav1)
pix_fmt: Pixel format (default: yuv420p)
g: Group of pictures size (default: 2)
crf: Constant rate factor (default: 30)
fast_decode: Fast decode tuning (default: 0)
camera_encoder_config: Video encoder settings (default: :class:`VideoEncoderConfig()`).
episode_indices: List of episode indices to convert (None = all episodes)
num_workers: Number of threads for parallel processing (default: 4)
max_episodes_per_batch: Maximum episodes per video batch to avoid memory issues (None = no limit)
@@ -1681,6 +1663,9 @@ def convert_image_to_video_dataset(
Returns:
New LeRobotDataset with images encoded as videos
"""
if camera_encoder_config is None:
camera_encoder_config = VideoEncoderConfig()
# Check that it's an image dataset
if len(dataset.meta.video_keys) > 0:
raise ValueError(
@@ -1704,7 +1689,10 @@ def convert_image_to_video_dataset(
logging.info(
f"Converting {len(episode_indices)} episodes with {len(img_keys)} cameras from {dataset.repo_id}"
)
logging.info(f"Video codec: {vcodec}, pixel format: {pix_fmt}, GOP: {g}, CRF: {crf}")
logging.info(
f"Video codec: {camera_encoder_config.vcodec}, pixel format: {camera_encoder_config.pix_fmt}, "
f"GOP: {camera_encoder_config.g}, CRF: {camera_encoder_config.crf}"
)
# Create new features dict, converting image features to video features
new_features = {}
@@ -1774,11 +1762,7 @@ def convert_image_to_video_dataset(
episode_indices=episode_indices,
temp_dir=temp_dir,
fps=fps,
vcodec=vcodec,
pix_fmt=pix_fmt,
g=g,
crf=crf,
fast_decode=fast_decode,
camera_encoder_config=camera_encoder_config,
)
logging.info(f"Processing camera: {img_key}")
@@ -1820,11 +1804,7 @@ def convert_image_to_video_dataset(
imgs_dir=imgs_dir,
video_path=video_path,
fps=fps,
vcodec=vcodec,
pix_fmt=pix_fmt,
g=g,
crf=crf,
fast_decode=fast_decode,
camera_encoder_config=camera_encoder_config,
overwrite=True,
)
@@ -1858,10 +1838,10 @@ def convert_image_to_video_dataset(
episodes_df.to_parquet(episodes_path, index=False)
# Update metadata info
new_meta.info["total_episodes"] = len(episode_indices)
new_meta.info["total_frames"] = sum(ep["length"] for ep in all_episode_metadata.values())
new_meta.info["total_tasks"] = dataset.meta.total_tasks
new_meta.info["splits"] = {"train": f"0:{len(episode_indices)}"}
new_meta.info.total_episodes = len(episode_indices)
new_meta.info.total_frames = sum(ep["length"] for ep in all_episode_metadata.values())
new_meta.info.total_tasks = dataset.meta.total_tasks
new_meta.info.splits = {"train": f"0:{len(episode_indices)}"}
# Update video info for all image keys (now videos)
# We need to manually set video info since update_video_info() checks video_keys first
@@ -1870,7 +1850,9 @@ def convert_image_to_video_dataset(
video_path = new_meta.root / new_meta.video_path.format(
video_key=img_key, chunk_index=0, file_index=0
)
new_meta.info["features"][img_key]["info"] = get_video_info(video_path)
new_meta.info.features[img_key]["info"] = get_video_info(
video_path, camera_encoder_config=camera_encoder_config
)
write_info(new_meta.info, new_meta.root)
+46 -11
View File
@@ -46,15 +46,19 @@ from .io_utils import (
write_info,
)
from .utils import (
DEFAULT_DEPTH_PATH,
DEFAULT_EPISODES_PATH,
DEFAULT_IMAGE_PATH,
update_chunk_file_indices,
)
from .video_utils import (
DepthEncoderConfig,
StreamingVideoEncoder,
VideoEncoderConfig,
concatenate_video_files,
encode_video_frames,
get_video_duration_in_s,
is_depth_feature,
)
logger = logging.getLogger(__name__)
@@ -65,14 +69,19 @@ def _encode_video_worker(
episode_index: int,
root: Path,
fps: int,
vcodec: str = "libsvtav1",
camera_encoder_config: VideoEncoderConfig | None = None,
encoder_threads: int | None = None,
) -> Path:
temp_path = Path(tempfile.mkdtemp(dir=root)) / f"{video_key}_{episode_index:03d}.mp4"
fpath = DEFAULT_IMAGE_PATH.format(image_key=video_key, episode_index=episode_index, frame_index=0)
img_dir = (root / fpath).parent
encode_video_frames(
img_dir, temp_path, fps, vcodec=vcodec, overwrite=True, encoder_threads=encoder_threads
img_dir,
temp_path,
fps,
camera_encoder_config=camera_encoder_config,
encoder_threads=encoder_threads,
overwrite=True,
)
shutil.rmtree(img_dir)
return temp_path
@@ -89,33 +98,40 @@ class DatasetWriter:
self,
meta: LeRobotDatasetMetadata,
root: Path,
vcodec: str,
camera_encoder_config: VideoEncoderConfig,
encoder_threads: int | None,
batch_encoding_size: int,
streaming_encoder: StreamingVideoEncoder | None = None,
initial_frames: int = 0,
depth_encoder_config: DepthEncoderConfig | None = None,
):
"""Initialize the writer with metadata, codec, and encoding config.
"""Initialize the writer with metadata, codec, and encoder config.
Args:
meta: Dataset metadata instance (used for feature schema, chunk
settings, and episode persistence).
root: Local dataset root directory.
vcodec: Video codec for encoding (e.g. ``'libsvtav1'``, ``'h264'``).
encoder_threads: Threads per encoder instance. ``None`` for auto.
camera_encoder_config: Video encoder settings applied to all cameras.
encoder_threads: Number of encoder threads (global). ``None``
lets the codec decide.
batch_encoding_size: Number of episodes to accumulate before
batch-encoding videos.
streaming_encoder: Optional pre-built :class:`StreamingVideoEncoder`
for real-time encoding. ``None`` disables streaming mode.
initial_frames: Starting frame count (non-zero when resuming).
depth_encoder_config: Optional depth-map encoder config used in
place of ``camera_encoder_config`` for keys present in
``meta.depth_keys``.
"""
self._meta = meta
self._root = root
self._vcodec = vcodec
self._camera_encoder_config = camera_encoder_config
self._depth_encoder_config = depth_encoder_config
self._encoder_threads = encoder_threads
self._batch_encoding_size = batch_encoding_size
self._streaming_encoder = streaming_encoder
# Writer state
self.image_writer: AsyncImageWriter | None = None
self.episode_buffer: dict = self._create_episode_buffer()
@@ -135,8 +151,16 @@ class DatasetWriter:
ep_buffer[key] = current_ep_idx if key == "episode_index" else []
return ep_buffer
def _is_depth_image_key(self, image_key: str) -> bool:
"""Whether *image_key* is a depth feature stored as per-frame images."""
ft = self._meta.features.get(image_key)
if ft is None or ft.get("dtype") != "image":
return False
return is_depth_feature(ft.get("info") or {})
def _get_image_file_path(self, episode_index: int, image_key: str, frame_index: int) -> Path:
fpath = DEFAULT_IMAGE_PATH.format(
path_template = DEFAULT_DEPTH_PATH if self._is_depth_image_key(image_key) else DEFAULT_IMAGE_PATH
fpath = path_template.format(
image_key=image_key, episode_index=episode_index, frame_index=frame_index
)
return self._root / fpath
@@ -284,7 +308,7 @@ class DatasetWriter:
episode_index,
self._root,
self._meta.fps,
self._vcodec,
self._camera_encoder_config,
self._encoder_threads,
): video_key
for video_key in self._meta.video_keys
@@ -495,7 +519,13 @@ class DatasetWriter:
# Update video info (only needed when first episode is encoded)
if episode_index == 0:
self._meta.update_video_info(video_key)
is_depth_key = video_key in set(self._meta.depth_keys)
cfg_for_info = (
self._depth_encoder_config
if is_depth_key and self._depth_encoder_config is not None
else self._camera_encoder_config
)
self._meta.update_video_info(video_key, camera_encoder_config=cfg_for_info)
write_info(self._meta.info, self._meta.root)
metadata = {
@@ -564,7 +594,12 @@ class DatasetWriter:
def _encode_temporary_episode_video(self, video_key: str, episode_index: int) -> Path:
"""Use ffmpeg to convert frames stored as png into mp4 videos."""
return _encode_video_worker(
video_key, episode_index, self._root, self._meta.fps, self._vcodec, self._encoder_threads
video_key,
episode_index,
self._root,
self._meta.fps,
self._camera_encoder_config,
self._encoder_threads,
)
def close_writer(self) -> None:
+189
View File
@@ -0,0 +1,189 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Depth encoding/decoding helpers for :class:`VideoEncoderConfig`.
"""
import math
from typing import Literal
import numpy as np
import torch
from numpy.typing import NDArray
DEPTH_QUANT_BITS: int = 12
DEPTH_QMAX: int = (1 << DEPTH_QUANT_BITS) - 1 # 4095
_MM_PER_METRE: float = 1000.0
_UINT16_MAX: int = 65535
DEFAULT_DEPTH_MIN: float = 0.01
DEFAULT_DEPTH_MAX: float = 10.0
DEFAULT_DEPTH_SHIFT: float = 3.5
DEFAULT_DEPTH_USE_LOG: bool = True
def _validate_log_quant_params(depth_min: float, shift: float) -> None:
"""Ensure ``log(depth_min + shift)`` is finite."""
if depth_min + shift <= 0:
raise ValueError(
f"depth_min + shift must be positive for logarithmic quantization, "
f"got depth_min={depth_min} + shift={shift} = {depth_min + shift}"
)
def _depth_input_to_float32_and_unit(
depth: NDArray[np.uint16] | NDArray[np.floating] | torch.Tensor,
input_unit: Literal["auto", "m", "mm"],
) -> tuple[NDArray[np.float32], Literal["m", "mm"]]:
"""Depth as float32 in the chosen unit, plus the resolved unit."""
if isinstance(depth, torch.Tensor):
t = depth.detach().cpu()
arr = t.numpy()
is_floating = t.is_floating_point()
else:
arr = np.asarray(depth)
is_floating = np.issubdtype(arr.dtype, np.floating)
resolved_unit: Literal["m", "mm"]
if input_unit == "auto":
resolved_unit = "m" if is_floating else "mm"
else:
resolved_unit = input_unit
# Convert to float32 to keep typing consistency
return np.asarray(arr, dtype=np.float32, order="K"), resolved_unit
def quantize_depth(
depth: NDArray[np.uint16] | NDArray[np.floating] | torch.Tensor,
depth_min: float = DEFAULT_DEPTH_MIN,
depth_max: float = DEFAULT_DEPTH_MAX,
shift: float = DEFAULT_DEPTH_SHIFT,
use_log: bool = DEFAULT_DEPTH_USE_LOG,
*,
input_unit: Literal["auto", "m", "mm"] = "auto",
) -> NDArray[np.uint16]:
"""Quantize depth to 12-bit codes (``uint16``, values ``0…DEPTH_QMAX``).
Depth maps are packed into 12-bit integer frames so they fit in standard
high-bit-depth pixel formats (e.g. ``yuv420p12le`` / ``gray12le``)
and can be encoded by widely supported video codecs (HEVC Main 12, ffv1).
Logarithmic quantization is the default because it allocates more quanta
to near-range depth, which matches the (1/depth) error profile of typical
depth sensors. Math is ported from BEHAVIOR-1K's ``obs_utils.py``.
**Input units**:
- ``input_unit="auto"`` (default): infer from dtype (floating = m, non-floating = mm).
- ``input_unit="mm"``: interpret input values as millimetres.
- ``input_unit="m"``: interpret input values as metres.
Quantization math runs in the **resolved input unit**.
``depth_min``, ``depth_max``, and ``shift`` are always in **metres**.
Args:
depth: Depth map; ``torch.Tensor`` is moved to CPU for conversion.
depth_min: Depth (metres) at quantum ``0``.
depth_max: Depth (metres) at quantum :data:`DEPTH_QMAX`.
shift: Depth shift (metres); used in log mode. Must satisfy ``depth_min + shift > 0``.
use_log: If ``True`` (default), quantize in log space.
input_unit: Input unit policy (``"auto"``, ``"mm"``, ``"m"``).
Returns:
``numpy.ndarray``, ``dtype=uint16``, same shape as ``depth``, values in
``[0, DEPTH_QMAX]``.
Raises:
ValueError: If ``input_unit`` is not ``"auto"``, ``"mm"``, or ``"m"``.
ValueError: If ``use_log=True`` and ``depth_min + shift <= 0``.
"""
if input_unit not in ("auto", "m", "mm"):
raise ValueError(f"input_unit must be 'auto', 'm', or 'mm', got {input_unit!r}")
depth_f, resolved_unit = _depth_input_to_float32_and_unit(depth, input_unit=input_unit)
depth_min_u = np.float32(depth_min) if resolved_unit == "m" else np.float32(depth_min * _MM_PER_METRE)
depth_max_u = np.float32(depth_max) if resolved_unit == "m" else np.float32(depth_max * _MM_PER_METRE)
shift_u = np.float32(shift) if resolved_unit == "m" else np.float32(shift * _MM_PER_METRE)
if use_log:
_validate_log_quant_params(depth_min, shift)
log_min = math.log(float(depth_min_u + shift_u))
log_max = math.log(float(depth_max_u + shift_u))
norm = (np.log(depth_f + shift_u) - log_min) / (log_max - log_min)
else:
norm = (depth_f - depth_min_u) / (depth_max_u - depth_min_u)
out = np.rint(norm * DEPTH_QMAX).clip(0, DEPTH_QMAX)
return out.astype(np.uint16, copy=False)
def dequantize_depth(
quantized: NDArray[np.uint16] | torch.Tensor,
depth_min: float = DEFAULT_DEPTH_MIN,
depth_max: float = DEFAULT_DEPTH_MAX,
shift: float = DEFAULT_DEPTH_SHIFT,
use_log: bool = DEFAULT_DEPTH_USE_LOG,
*,
output_unit: Literal["m", "mm"] = "mm",
) -> NDArray[np.uint16] | NDArray[np.float32]:
"""Inverse of :func:`quantize_depth`.
Tuning arguments **must match** :func:`quantize_depth`.
Decoding inverts the same normalized code mapping as :func:`quantize_depth`
using ``depth_min`` / ``depth_max`` / ``shift`` (in metres), then returns
the requested output unit.
Args:
quantized: 12-bit codes ``[0, DEPTH_QMAX]``, ``dtype=uint16``.
depth_min, depth_max, shift, use_log: Same as :func:`quantize_depth` (metres).
output_unit: ``\"mm\"`` returns ``uint16`` millimetres (``rint``, clip
``[0, 65535]``). ``\"m\"`` returns ``float32`` metres in
``[depth_min, depth_max]``.
Returns:
Depth map in the requested unit and dtype.
Raises:
ValueError: If ``use_log=True`` and ``depth_min + shift <= 0``.
ValueError: If ``output_unit`` is not ``\"m\"`` or ``\"mm\"``.
"""
if output_unit not in ("m", "mm"):
raise ValueError(f"output_unit must be 'm' or 'mm', got {output_unit!r}")
if isinstance(quantized, torch.Tensor):
quantized = quantized.detach().cpu().numpy()
q = np.asarray(quantized, dtype=np.uint16, order="K")
norm = q.astype(np.float32, copy=False) / DEPTH_QMAX
depth_min_mm = np.float32(depth_min * _MM_PER_METRE)
depth_max_mm = np.float32(depth_max * _MM_PER_METRE)
shift_mm = np.float32(shift * _MM_PER_METRE)
if use_log:
_validate_log_quant_params(depth_min, shift)
log_min = math.log(float(depth_min_mm + shift_mm))
log_max = math.log(float(depth_max_mm + shift_mm))
depth_mm = np.exp(norm * (log_max - log_min) + log_min) - shift_mm
else:
depth_mm = norm * (depth_max_mm - depth_min_mm) + depth_min_mm
depth_mm = np.clip(depth_mm, depth_min_mm, depth_max_mm).astype(np.float32, copy=False)
if output_unit == "m":
return (depth_mm / np.float32(_MM_PER_METRE)).astype(np.float32, copy=False)
mm = np.rint(depth_mm).clip(0, _UINT16_MAX)
return mm.astype(np.uint16, copy=False)
+7 -4
View File
@@ -19,6 +19,7 @@ from pprint import pformat
import torch
from lerobot.configs import PreTrainedConfig
from lerobot.configs.rewards import RewardModelConfig
from lerobot.configs.train import TrainPipelineConfig
from lerobot.transforms import ImageTransforms
from lerobot.utils.constants import ACTION, IMAGENET_STATS, OBS_PREFIX, REWARD
@@ -30,12 +31,14 @@ from .streaming_dataset import StreamingLeRobotDataset
def resolve_delta_timestamps(
cfg: PreTrainedConfig, ds_meta: LeRobotDatasetMetadata
cfg: PreTrainedConfig | RewardModelConfig, ds_meta: LeRobotDatasetMetadata
) -> dict[str, list] | None:
"""Resolves delta_timestamps by reading from the 'delta_indices' properties of the PreTrainedConfig.
"""Resolves delta_timestamps by reading from the 'delta_indices' properties of the config.
Args:
cfg (PreTrainedConfig): The PreTrainedConfig to read delta_indices from.
cfg (PreTrainedConfig | RewardModelConfig): The config to read delta_indices from. Both
``PreTrainedConfig`` and concrete ``RewardModelConfig`` subclasses expose the
``{observation,action,reward}_delta_indices`` properties used below.
ds_meta (LeRobotDatasetMetadata): The dataset from which features and fps are used to build
delta_timestamps against.
@@ -82,7 +85,7 @@ def make_dataset(cfg: TrainPipelineConfig) -> LeRobotDataset | MultiLeRobotDatas
ds_meta = LeRobotDatasetMetadata(
cfg.dataset.repo_id, root=cfg.dataset.root, revision=cfg.dataset.revision
)
delta_timestamps = resolve_delta_timestamps(cfg.policy, ds_meta)
delta_timestamps = resolve_delta_timestamps(cfg.trainable_config, ds_meta)
if not cfg.dataset.streaming:
dataset = LeRobotDataset(
cfg.dataset.repo_id,
+32 -22
View File
@@ -28,6 +28,7 @@ from .utils import (
DEFAULT_DATA_PATH,
DEFAULT_VIDEO_FILE_SIZE_IN_MB,
DEFAULT_VIDEO_PATH,
DatasetInfo,
)
@@ -78,8 +79,8 @@ def create_empty_dataset_info(
chunks_size: int | None = None,
data_files_size_in_mb: int | None = None,
video_files_size_in_mb: int | None = None,
) -> dict:
"""Create a template dictionary for a new dataset's `info.json`.
) -> DatasetInfo:
"""Create a template ``DatasetInfo`` object for a new dataset's ``meta/info.json``.
Args:
codebase_version (str): The version of the LeRobot codebase.
@@ -87,25 +88,24 @@ def create_empty_dataset_info(
features (dict): The LeRobot features dictionary for the dataset.
use_videos (bool): Whether the dataset will store videos.
robot_type (str | None): The type of robot used, if any.
chunks_size (int | None): Max files per chunk directory. Defaults to ``DEFAULT_CHUNK_SIZE``.
data_files_size_in_mb (int | None): Max parquet file size in MB. Defaults to ``DEFAULT_DATA_FILE_SIZE_IN_MB``.
video_files_size_in_mb (int | None): Max video file size in MB. Defaults to ``DEFAULT_VIDEO_FILE_SIZE_IN_MB``.
Returns:
dict: A dictionary with the initial dataset metadata.
DatasetInfo: A typed dataset information object with initial metadata.
"""
return {
"codebase_version": codebase_version,
"robot_type": robot_type,
"total_episodes": 0,
"total_frames": 0,
"total_tasks": 0,
"chunks_size": chunks_size or DEFAULT_CHUNK_SIZE,
"data_files_size_in_mb": data_files_size_in_mb or DEFAULT_DATA_FILE_SIZE_IN_MB,
"video_files_size_in_mb": video_files_size_in_mb or DEFAULT_VIDEO_FILE_SIZE_IN_MB,
"fps": fps,
"splits": {},
"data_path": DEFAULT_DATA_PATH,
"video_path": DEFAULT_VIDEO_PATH if use_videos else None,
"features": features,
}
return DatasetInfo(
codebase_version=codebase_version,
fps=fps,
features=features,
robot_type=robot_type,
chunks_size=chunks_size or DEFAULT_CHUNK_SIZE,
data_files_size_in_mb=data_files_size_in_mb or DEFAULT_DATA_FILE_SIZE_IN_MB,
video_files_size_in_mb=video_files_size_in_mb or DEFAULT_VIDEO_FILE_SIZE_IN_MB,
data_path=DEFAULT_DATA_PATH,
video_path=DEFAULT_VIDEO_PATH if use_videos else None,
)
def check_delta_timestamps(
@@ -294,10 +294,20 @@ def validate_feature_image_or_video(
# Note: The check of pixels range ([0,1] for float and [0,255] for uint8) is done by the image writer threads.
error_message = ""
if isinstance(value, np.ndarray):
actual_shape = value.shape
c, h, w = expected_shape
if len(actual_shape) != 3 or (actual_shape != (c, h, w) and actual_shape != (h, w, c)):
error_message += f"The feature '{name}' of shape '{actual_shape}' does not have the expected shape '{(c, h, w)}' or '{(h, w, c)}'.\n"
actual_shape = tuple(value.shape)
expected = tuple(expected_shape)
if len(expected) == 2:
# Single-channel features (e.g. depth maps) — accept (H,W), (1,H,W), (H,W,1)
h, w = expected
valid = actual_shape in {(h, w), (1, h, w), (h, w, 1)}
if not valid:
error_message += f"The feature '{name}' of shape '{actual_shape}' does not have the expected shape '{(h, w)}', '{(1, h, w)}', or '{(h, w, 1)}'.\n"
elif len(expected) == 3:
c, h, w = expected
if len(actual_shape) != 3 or (actual_shape != (c, h, w) and actual_shape != (h, w, c)):
error_message += f"The feature '{name}' of shape '{actual_shape}' does not have the expected shape '{(c, h, w)}' or '{(h, w, c)}'.\n"
else:
error_message += f"The feature '{name}' has an unsupported expected_shape '{expected}'.\n"
elif isinstance(value, PILImage.Image):
pass
else:
+63 -7
View File
@@ -41,15 +41,56 @@ def safe_stop_image_writer(func):
return wrapper
def image_array_to_pil_image(image_array: np.ndarray, range_check: bool = True) -> PIL.Image.Image:
# TODO(aliberts): handle 1 channel and 4 for depth images
if image_array.ndim != 3:
raise ValueError(f"The array has {image_array.ndim} dimensions, but 3 is expected for an image.")
# Single-channel dtypes that PIL natively maps to the matching mode
# (``uint8`` → ``L``, ``uint16`` → ``I;16``, ``float32`` → ``F``).
GRAYSCALE_DTYPES: tuple[np.dtype, ...] = (
np.dtype("uint8"),
np.dtype("uint16"),
np.dtype("float32"),
)
def image_array_to_pil_image(image_array: np.ndarray, range_check: bool = True) -> PIL.Image.Image:
"""Convert a NumPy array to a PIL Image, preserving precision for grayscale.
Behaviour by shape:
- ``(H, W)`` or ``(1, H, W)`` / ``(H, W, 1)``: single-channel grayscale.
The native dtype is preserved using the matching PIL mode
(``L`` / ``I;16`` / ``F``). This is the path used for raw depth maps (no rescaling, clamping, or downcasting)
- ``(3, H, W)`` / ``(H, W, 3)``: RGB. Channels-first inputs are transposed
to channels-last. Float inputs in ``[0, 1]`` are scaled to ``uint8``
(existing behaviour, gated by ``range_check``).
Other shapes / channel counts raise ``NotImplementedError`` or
``ValueError``.
"""
if image_array.ndim not in (2, 3):
raise ValueError(
f"The array has {image_array.ndim} dimensions, but 2 or 3 is expected for an image."
)
# Squeeze 3D single-channel inputs to 2D so depth maps work whether the
# caller emits (H, W), (1, H, W), or (H, W, 1).
if image_array.ndim == 3:
if image_array.shape[0] == 1:
image_array = image_array[0]
elif image_array.shape[-1] == 1:
image_array = image_array[..., 0]
if image_array.ndim == 2:
if image_array.dtype not in GRAYSCALE_DTYPES:
raise ValueError(
f"Unsupported single-channel image dtype: {image_array.dtype}. "
f"Supported dtypes: {sorted(str(d) for d in GRAYSCALE_DTYPES)}."
)
return PIL.Image.fromarray(np.ascontiguousarray(image_array))
# 3D path: must be RGB (3 channels), channels-first or channels-last.
if image_array.shape[0] == 3:
# Transpose from pytorch convention (C, H, W) to (H, W, C)
image_array = image_array.transpose(1, 2, 0)
elif image_array.shape[-1] != 3:
raise NotImplementedError(
f"The image has {image_array.shape[-1]} channels, but 3 is required for now."
@@ -71,13 +112,28 @@ def image_array_to_pil_image(image_array: np.ndarray, range_check: bool = True)
return PIL.Image.fromarray(image_array)
def save_kwargs_for_path(fpath: Path, compress_level: int) -> dict:
"""Pick the right format-specific kwargs for :meth:`PIL.Image.Image.save`.
PNG uses ``compress_level`` (09, zlib). TIFF uses ``compression`` (raw) for lossless raw depth maps.
"""
suffix = Path(fpath).suffix.lower()
if suffix == ".png":
return {"compress_level": compress_level}
if suffix in (".tif", ".tiff"):
return {"compression": "raw"}
return {}
def write_image(image: np.ndarray | PIL.Image.Image, fpath: Path, compress_level: int = 1):
"""
Saves a NumPy array or PIL Image to a file.
This function handles both NumPy arrays and PIL Image objects, converting
the former to a PIL Image before saving. It includes error handling for
the save operation.
the save operation. The output format is inferred from the *fpath*
extension: ``.png`` → PNG with ``compress_level``, ``.tiff`` / ``.tif``
→ lossless raw depth maps (TIFF).
Args:
image (np.ndarray | PIL.Image.Image): The image data to save.
@@ -101,7 +157,7 @@ def write_image(image: np.ndarray | PIL.Image.Image, fpath: Path, compress_level
img = image
else:
raise TypeError(f"Unsupported image type: {type(image)}")
img.save(fpath, compress_level=compress_level)
img.save(fpath, **save_kwargs_for_path(Path(fpath), compress_level))
except Exception as e:
logger.error("Error writing image %s: %s", fpath, e)
+7 -10
View File
@@ -39,6 +39,7 @@ from .utils import (
EPISODES_DIR,
INFO_PATH,
STATS_PATH,
DatasetInfo,
serialize_dict,
)
@@ -115,25 +116,21 @@ def embed_images(dataset: datasets.Dataset) -> datasets.Dataset:
return dataset
def write_info(info: dict, local_dir: Path) -> None:
write_json(info, local_dir / INFO_PATH)
def write_info(info: DatasetInfo, local_dir: Path) -> None:
write_json(info.to_dict(), local_dir / INFO_PATH)
def load_info(local_dir: Path) -> dict:
def load_info(local_dir: Path) -> DatasetInfo:
"""Load dataset info metadata from its standard file path.
Also converts shape lists to tuples for consistency.
Args:
local_dir (Path): The root directory of the dataset.
Returns:
dict: The dataset information dictionary.
DatasetInfo: The typed dataset information object.
"""
info = load_json(local_dir / INFO_PATH)
for ft in info["features"].values():
ft["shape"] = tuple(ft["shape"])
return info
raw = load_json(local_dir / INFO_PATH)
return DatasetInfo.from_dict(raw)
def write_stats(stats: dict, local_dir: Path) -> None:
+87 -44
View File
@@ -35,9 +35,11 @@ from .utils import (
is_valid_version,
)
from .video_utils import (
DepthEncoderConfig,
StreamingVideoEncoder,
get_safe_default_codec,
resolve_vcodec,
VideoEncoderConfig,
get_safe_default_video_backend,
seed_depth_feature_info,
)
logger = logging.getLogger(__name__)
@@ -58,10 +60,11 @@ class LeRobotDataset(torch.utils.data.Dataset):
video_backend: str | None = None,
return_uint8: bool = False,
batch_encoding_size: int = 1,
vcodec: str = "libsvtav1",
camera_encoder_config: VideoEncoderConfig | None = None,
depth_encoder_config: DepthEncoderConfig | None = None,
encoder_threads: int | None = None,
streaming_encoding: bool = False,
encoder_queue_maxsize: int = 30,
encoder_threads: int | None = None,
):
"""
2 modes are available for instantiating this class, depending on 2 different use cases:
@@ -177,16 +180,15 @@ class LeRobotDataset(torch.utils.data.Dataset):
You can also use the 'pyav' decoder used by Torchvision, which used to be the default option, or 'video_reader' which is another decoder of Torchvision.
batch_encoding_size (int, optional): Number of episodes to accumulate before batch encoding videos.
Set to 1 for immediate encoding (default), or higher for batched encoding. Defaults to 1.
vcodec (str, optional): Video codec for encoding videos during recording. Options: 'h264', 'hevc',
'libsvtav1', 'auto', or hardware-specific codecs like 'h264_videotoolbox', 'h264_nvenc'.
Defaults to 'libsvtav1'. Use 'auto' to auto-detect the best available hardware encoder.
camera_encoder_config (VideoEncoderConfig | None, optional): Video encoder settings for cameras
(codec, quality, etc.). Defaults to
:class:`~lerobot.datasets.video_utils.VideoEncoderConfig` defaults when ``None``.
encoder_threads (int | None, optional): Number of encoder threads (global). ``None`` lets the
codec decide.
streaming_encoding (bool, optional): If True, encode video frames in real-time during capture
instead of writing PNG images first. This makes save_episode() near-instant. Defaults to False.
encoder_queue_maxsize (int, optional): Maximum number of frames to buffer per camera when using
streaming encoding. Defaults to 30 (~1s at 30fps).
encoder_threads (int | None, optional): Number of threads per encoder instance. None lets the
codec auto-detect (default). Lower values reduce CPU usage per encoder. Maps to 'lp' (via svtav1-params) for
libsvtav1 and 'threads' for h264/hevc.
Note:
Write-mode parameters (``streaming_encoding``, ``batch_encoding_size``) passed to
@@ -202,10 +204,13 @@ class LeRobotDataset(torch.utils.data.Dataset):
self.episodes = episodes
self.tolerance_s = tolerance_s
self.revision = revision if revision else CODEBASE_VERSION
self._video_backend = video_backend if video_backend else get_safe_default_codec()
self._video_backend = video_backend if video_backend else get_safe_default_video_backend()
self._return_uint8 = return_uint8
self._batch_encoding_size = batch_encoding_size
self._vcodec = resolve_vcodec(vcodec)
if camera_encoder_config is None:
camera_encoder_config = VideoEncoderConfig()
self._camera_encoder_config = camera_encoder_config
self._depth_encoder_config = depth_encoder_config
self._encoder_threads = encoder_threads
if self._requested_root is not None:
@@ -248,16 +253,23 @@ class LeRobotDataset(torch.utils.data.Dataset):
DeprecationWarning,
stacklevel=2,
)
seed_depth_feature_info(self.meta.features, self._depth_encoder_config)
streaming_enc = None
if streaming_encoding and len(self.meta.video_keys) > 0:
streaming_enc = self._build_streaming_encoder(
self.meta.fps, self._vcodec, encoder_queue_maxsize, encoder_threads
self.meta.fps,
self._camera_encoder_config,
self._encoder_threads,
encoder_queue_maxsize,
depth_encoder_config=self._depth_encoder_config,
depth_keys=self.meta.depth_keys,
)
self.writer = DatasetWriter(
meta=self.meta,
root=self.root,
vcodec=self._vcodec,
encoder_threads=encoder_threads,
camera_encoder_config=self._camera_encoder_config,
depth_encoder_config=self._depth_encoder_config,
encoder_threads=self._encoder_threads,
batch_encoding_size=batch_encoding_size,
streaming_encoder=streaming_enc,
initial_frames=self.meta.total_frames,
@@ -298,19 +310,20 @@ class LeRobotDataset(torch.utils.data.Dataset):
@staticmethod
def _build_streaming_encoder(
fps: int,
vcodec: str,
encoder_queue_maxsize: int,
camera_encoder_config: VideoEncoderConfig,
encoder_threads: int | None,
encoder_queue_maxsize: int,
*,
depth_encoder_config: DepthEncoderConfig | None = None,
depth_keys: list[str] | None = None,
) -> StreamingVideoEncoder:
return StreamingVideoEncoder(
fps=fps,
vcodec=vcodec,
pix_fmt="yuv420p",
g=2,
crf=30,
preset=None,
queue_maxsize=encoder_queue_maxsize,
camera_encoder_config=camera_encoder_config,
encoder_threads=encoder_threads,
queue_maxsize=encoder_queue_maxsize,
depth_encoder_config=depth_encoder_config,
depth_keys=depth_keys,
)
# ── Metadata properties ───────────────────────────────────────────
@@ -625,11 +638,14 @@ class LeRobotDataset(torch.utils.data.Dataset):
image_writer_threads: int = 0,
video_backend: str | None = None,
batch_encoding_size: int = 1,
vcodec: str = "libsvtav1",
camera_encoder_config: VideoEncoderConfig | None = None,
depth_encoder_config: DepthEncoderConfig | None = None,
metadata_buffer_size: int = 10,
streaming_encoding: bool = False,
encoder_queue_maxsize: int = 30,
encoder_threads: int | None = None,
video_files_size_in_mb: int | None = None,
data_files_size_in_mb: int | None = None,
) -> "LeRobotDataset":
"""Create a new LeRobotDataset from scratch for recording data.
@@ -654,20 +670,23 @@ class LeRobotDataset(torch.utils.data.Dataset):
video_backend: Video decoding backend (used when reading back).
batch_encoding_size: Number of episodes to accumulate before
batch-encoding videos. ``1`` means encode immediately.
vcodec: Video codec for encoding. Options include ``'libsvtav1'``,
``'h264'``, ``'hevc'``, ``'auto'``.
camera_encoder_config: Video encoder settings for cameras; defaults
match :class:`~lerobot.datasets.video_utils.VideoEncoderConfig`
when ``None``.
encoder_threads: Number of encoder threads (global). ``None``
lets the codec decide.
metadata_buffer_size: Number of episode metadata records to buffer
before flushing to parquet.
streaming_encoding: If ``True``, encode video frames in real-time
during capture instead of writing images first.
encoder_queue_maxsize: Max buffered frames per camera when using
streaming encoding.
encoder_threads: Threads per encoder instance. ``None`` for auto.
Returns:
A new :class:`LeRobotDataset` in write mode.
"""
vcodec = resolve_vcodec(vcodec)
if camera_encoder_config is None:
camera_encoder_config = VideoEncoderConfig()
obj = cls.__new__(cls)
obj.meta = LeRobotDatasetMetadata.create(
repo_id=repo_id,
@@ -677,6 +696,8 @@ class LeRobotDataset(torch.utils.data.Dataset):
root=root,
use_videos=use_videos,
metadata_buffer_size=metadata_buffer_size,
video_files_size_in_mb=video_files_size_in_mb,
data_files_size_in_mb=data_files_size_in_mb,
)
obj.repo_id = obj.meta.repo_id
obj._requested_root = obj.meta.root
@@ -686,23 +707,32 @@ class LeRobotDataset(torch.utils.data.Dataset):
obj.image_transforms = None
obj.delta_timestamps = None
obj.episodes = None
obj._video_backend = video_backend if video_backend is not None else get_safe_default_codec()
obj._video_backend = video_backend if video_backend is not None else get_safe_default_video_backend()
obj._return_uint8 = False
obj._batch_encoding_size = batch_encoding_size
obj._vcodec = vcodec
obj._camera_encoder_config = camera_encoder_config
obj._depth_encoder_config = depth_encoder_config
obj._encoder_threads = encoder_threads
seed_depth_feature_info(obj.meta.features, depth_encoder_config)
# Reader is lazily created on first access (write-only mode)
obj.reader = None
# Create writer
streaming_enc = None
if streaming_encoding and len(obj.meta.video_keys) > 0:
streaming_enc = cls._build_streaming_encoder(fps, vcodec, encoder_queue_maxsize, encoder_threads)
streaming_enc = cls._build_streaming_encoder(
fps,
camera_encoder_config,
encoder_threads,
encoder_queue_maxsize,
depth_encoder_config=depth_encoder_config,
depth_keys=obj.meta.depth_keys,
)
obj.writer = DatasetWriter(
meta=obj.meta,
root=obj.root,
vcodec=vcodec,
camera_encoder_config=camera_encoder_config,
depth_encoder_config=depth_encoder_config,
encoder_threads=encoder_threads,
batch_encoding_size=batch_encoding_size,
streaming_encoder=streaming_enc,
@@ -725,12 +755,13 @@ class LeRobotDataset(torch.utils.data.Dataset):
force_cache_sync: bool = False,
video_backend: str | None = None,
batch_encoding_size: int = 1,
vcodec: str = "libsvtav1",
camera_encoder_config: VideoEncoderConfig | None = None,
depth_encoder_config: DepthEncoderConfig | None = None,
encoder_threads: int | None = None,
image_writer_processes: int = 0,
image_writer_threads: int = 0,
streaming_encoding: bool = False,
encoder_queue_maxsize: int = 30,
encoder_threads: int | None = None,
) -> "LeRobotDataset":
"""Resume recording on an existing dataset.
@@ -753,13 +784,16 @@ class LeRobotDataset(torch.utils.data.Dataset):
video_backend: Video decoding backend for reading back data.
batch_encoding_size: Number of episodes to accumulate before
batch-encoding videos.
vcodec: Video codec for encoding.
camera_encoder_config: Video encoder settings for cameras; defaults
match :class:`~lerobot.datasets.video_utils.VideoEncoderConfig`
when ``None``.
encoder_threads: Number of encoder threads (global). ``None``
lets the codec decide.
image_writer_processes: Subprocesses for async image writing.
image_writer_threads: Threads for async image writing.
streaming_encoding: If ``True``, encode video in real-time during
capture.
encoder_queue_maxsize: Max buffered frames per camera for streaming.
encoder_threads: Threads per encoder instance. ``None`` for auto.
Returns:
A :class:`LeRobotDataset` in write mode, ready to append episodes.
@@ -770,7 +804,6 @@ class LeRobotDataset(torch.utils.data.Dataset):
"Writing into the revision-safe Hub snapshot cache (used when root=None) would corrupt "
"the shared cache. Please provide a local directory path."
)
vcodec = resolve_vcodec(vcodec)
obj = cls.__new__(cls)
obj.repo_id = repo_id
obj._requested_root = Path(root)
@@ -779,11 +812,9 @@ class LeRobotDataset(torch.utils.data.Dataset):
obj.image_transforms = None
obj.delta_timestamps = None
obj.episodes = None
obj._video_backend = video_backend if video_backend else get_safe_default_codec()
obj._video_backend = video_backend if video_backend else get_safe_default_video_backend()
obj._return_uint8 = False
obj._batch_encoding_size = batch_encoding_size
obj._vcodec = vcodec
obj._encoder_threads = encoder_threads
if obj._requested_root is not None:
obj._requested_root.mkdir(exist_ok=True, parents=True)
@@ -792,21 +823,33 @@ class LeRobotDataset(torch.utils.data.Dataset):
obj.meta = LeRobotDatasetMetadata(
obj.repo_id, obj._requested_root, obj.revision, force_cache_sync=force_cache_sync
)
if camera_encoder_config is None:
camera_encoder_config = VideoEncoderConfig()
obj._camera_encoder_config = camera_encoder_config
obj._depth_encoder_config = depth_encoder_config
obj._encoder_threads = encoder_threads
obj.root = obj.meta.root
seed_depth_feature_info(obj.meta.features, depth_encoder_config)
# Reader is lazily created on first access (write-only mode)
obj.reader = None
# Create writer for appending
streaming_enc = None
if streaming_encoding and len(obj.meta.video_keys) > 0:
streaming_enc = cls._build_streaming_encoder(
obj.meta.fps, vcodec, encoder_queue_maxsize, encoder_threads
obj.meta.fps,
camera_encoder_config,
encoder_threads,
encoder_queue_maxsize,
depth_encoder_config=depth_encoder_config,
depth_keys=obj.meta.depth_keys,
)
obj.writer = DatasetWriter(
meta=obj.meta,
root=obj.root,
vcodec=vcodec,
camera_encoder_config=camera_encoder_config,
depth_encoder_config=depth_encoder_config,
encoder_threads=encoder_threads,
batch_encoding_size=batch_encoding_size,
streaming_encoder=streaming_enc,
+2 -2
View File
@@ -123,7 +123,7 @@ class MultiLeRobotDataset(torch.utils.data.Dataset):
NOTE: Fow now, this relies on a check in __init__ to make sure all sub-datasets have the same info.
"""
return self._datasets[0].meta.info["fps"]
return self._datasets[0].meta.info.fps
@property
def video(self) -> bool:
@@ -133,7 +133,7 @@ class MultiLeRobotDataset(torch.utils.data.Dataset):
NOTE: Fow now, this relies on a check in __init__ to make sure all sub-datasets have the same info.
"""
return self._datasets[0].meta.info.get("video", False)
return len(self._datasets[0].meta.video_keys) > 0
@property
def features(self) -> datasets.Features:
+311
View File
@@ -0,0 +1,311 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PyAV-based compatibility checks for :class:`VideoEncoderConfig`.
Centralises all :mod:`av` introspection of the bundled FFmpeg build.
Checks degrade to a no-op when the target codec isn't available locally.
"""
from __future__ import annotations
import functools
import logging
from typing import TYPE_CHECKING, Any, Literal
import av
import numpy as np
import torch
from lerobot.datasets.depth_utils import (
DEFAULT_DEPTH_MAX,
DEFAULT_DEPTH_MIN,
DEFAULT_DEPTH_SHIFT,
DEFAULT_DEPTH_USE_LOG,
quantize_depth,
dequantize_depth,
)
if TYPE_CHECKING:
from lerobot.datasets.video_utils import VideoEncoderConfig
logger = logging.getLogger(__name__)
# Pixel formats supported by the depth encode/decode helpers below. Both are
# 16-bit-word formats that carry 12 significant bits per sample, matching the
# ``DEPTH_QMAX = 4095`` quantization range.
DEPTH_PIX_FMTS: tuple[str, ...] = ("yuv420p12le", "gray12le")
# Neutral chroma for 12-bit YUV (the midpoint of [0, 4095]). Filling the U/V
# planes with this value keeps the encoder from spending bits on chroma noise
# when only the Y plane carries information.
_NEUTRAL_CHROMA_12BIT: int = 2048
FFMPEG_NUMERIC_OPTION_TYPES = ("INT", "INT64", "UINT64", "FLOAT", "DOUBLE")
FFMPEG_INTEGER_OPTION_TYPES = ("INT", "INT64", "UINT64")
def _write_u16_plane(plane: av.video.plane.VideoPlane, src: np.ndarray, fill_value: int | None = None) -> None:
"""Copy ``src`` into a uint16 plane respecting FFmpeg line padding."""
height, width = src.shape
stride_u16 = plane.line_size // np.dtype(np.uint16).itemsize
dst = np.frombuffer(plane, dtype=np.uint16).reshape(height, stride_u16)
if fill_value is not None:
dst.fill(fill_value)
dst[:, :width] = src
def encode_depth_frame_pyav(
depth: np.ndarray | torch.Tensor,
*,
pix_fmt: str = "yuv420p12le",
depth_min: float = DEFAULT_DEPTH_MIN,
depth_max: float = DEFAULT_DEPTH_MAX,
shift: float = DEFAULT_DEPTH_SHIFT,
use_log: bool = DEFAULT_DEPTH_USE_LOG,
input_unit: Literal["auto", "m", "mm"] = "auto",
) -> av.VideoFrame:
"""Quantize depth and pack it into a 12-bit PyAV video frame.
Args:
depth: Depth frame to encode (H, W). Unit handling follows
:func:`lerobot.datasets.depth_utils.quantize_depth`.
pix_fmt: Target pixel format. Must be one of :data:`DEPTH_PIX_FMTS`.
depth_min, depth_max, shift, use_log, input_unit: Forwarded to
:func:`quantize_depth`.
Returns:
An :class:`av.VideoFrame` in ``pix_fmt`` with quantized depth in the
luminance plane.
"""
if pix_fmt not in DEPTH_PIX_FMTS:
raise ValueError(f"Unsupported depth pix_fmt={pix_fmt!r}; expected one of {DEPTH_PIX_FMTS}")
quantized_depth = quantize_depth(
depth,
depth_min=depth_min,
depth_max=depth_max,
shift=shift,
use_log=use_log,
input_unit=input_unit,
)
if quantized_depth.ndim != 2:
raise ValueError(f"depth must be a 2D frame; got shape {quantized_depth.shape}")
quantized_depth = np.ascontiguousarray(quantized_depth, dtype=np.uint16)
height, width = quantized_depth.shape
if pix_fmt == "gray12le":
frame = av.VideoFrame(width=width, height=height, format="gray12le")
_write_u16_plane(frame.planes[0], quantized_depth)
return frame
if height % 2 != 0 or width % 2 != 0:
raise ValueError("yuv420p12le requires even H and W")
frame = av.VideoFrame(width=width, height=height, format="yuv420p12le")
_write_u16_plane(frame.planes[0], quantized_depth)
neutral_chroma = np.full((height // 2, width // 2), _NEUTRAL_CHROMA_12BIT, dtype=np.uint16)
_write_u16_plane(frame.planes[1], neutral_chroma, fill_value=_NEUTRAL_CHROMA_12BIT)
_write_u16_plane(frame.planes[2], neutral_chroma, fill_value=_NEUTRAL_CHROMA_12BIT)
return frame
def decode_depth_frame_pyav(
frame: av.VideoFrame | list[av.VideoFrame],
*,
depth_min: float = DEFAULT_DEPTH_MIN,
depth_max: float = DEFAULT_DEPTH_MAX,
shift: float = DEFAULT_DEPTH_SHIFT,
use_log: bool = DEFAULT_DEPTH_USE_LOG,
return_quantized: bool = False,
output_unit: Literal["m", "mm"] = "m",
) -> np.ndarray:
"""Decode one or many depth video frames to quantized or metric depth.
Args:
frame: A single depth frame or a list of depth frames.
depth_min, depth_max, shift, use_log: Forwarded to
:func:`dequantize_depth`.
return_quantized: If ``True``, return raw 12-bit quanta as ``uint16``.
output_unit: Unit for dequantized output (``"m"`` or ``"mm"``).
Returns:
``(H, W)`` array for a single frame, or ``(N, H, W)`` for a list.
"""
frames = frame if isinstance(frame, list) else [frame]
quantized = np.stack([f.reformat(format="gray12le").to_ndarray() for f in frames]).astype(np.uint16, copy=False)
if return_quantized:
return quantized[0] if len(frames) == 1 else quantized
decoded = dequantize_depth(
quantized,
depth_min=depth_min,
depth_max=depth_max,
shift=shift,
use_log=use_log,
output_unit=output_unit,
)
return decoded[0] if len(frames) == 1 else decoded
@functools.cache
def get_codec(vcodec: str) -> av.codec.Codec | None:
"""PyAV write-mode ``Codec`` for *vcodec*, or ``None`` if unavailable."""
try:
return av.codec.Codec(vcodec, "w")
except Exception:
return None
@functools.cache
def _get_codec_video_formats(vcodec: str) -> dict[str, av.option.Option]:
"""Private-option name → PyAV ``Option`` for *vcodec* (empty if unavailable)."""
codec = get_codec(vcodec)
if codec is None:
return {}
return {opt.name: opt for opt in codec.descriptor.options}
@functools.cache
def _get_codec_video_formats(vcodec: str) -> tuple[str, ...]:
"""Pixel formats accepted by *vcodec* in PyAV's preferred order (empty if unknown)."""
codec = get_codec(vcodec)
if codec is None:
return ()
return tuple(fmt.name for fmt in (codec.video_formats or []))
def detect_available_encoders_pyav(encoders: list[str] | str) -> list[str]:
"""Return the subset of *encoders* available as video encoders in the local FFmpeg build.
Each name is probed directly via :func:`get_codec`; input order is preserved.
"""
if isinstance(encoders, str):
encoders = [encoders]
available: list[str] = []
for name in encoders:
codec = get_codec(name)
if codec is not None and codec.type == "video":
available.append(name)
else:
logger.debug("encoder '%s' not available as video encoder", name)
return available
def _check_option_value(vcodec: str, label: str, value: Any, opt: av.option.Option) -> None:
"""Range-check numeric *value* and choice-check string *value* against *opt*."""
type_name = opt.type.name
if type_name in FFMPEG_NUMERIC_OPTION_TYPES:
if isinstance(value, bool):
raise ValueError(
f"{label}={value!r} is not numeric; codec {vcodec!r} expects a number for this option."
)
elif isinstance(value, str):
try:
num_val = float(value)
except ValueError as e:
raise ValueError(
f"{label}={value!r} is not numeric; codec {vcodec!r} expects a number for this option."
) from e
elif isinstance(value, (float, int)):
num_val = value
else:
raise ValueError(
f"{label}={value!r} is not numeric; codec {vcodec!r} expects a number for this option."
)
# Check integer type compatibility
if type_name in FFMPEG_INTEGER_OPTION_TYPES and not num_val.is_integer():
raise ValueError(
f"{label}={num_val!r} must be an integer for codec {vcodec!r} "
f"(FFmpeg option {opt.name!r} is {type_name}); float values are not allowed."
)
# Check numeric range compatibility
lo, hi = float(opt.min), float(opt.max)
if lo < hi and not (lo <= num_val <= hi):
raise ValueError(
f"{label}={num_val} is out of range for codec {vcodec!r}; must be in [{lo}, {hi}]"
)
elif type_name == "STRING":
if isinstance(value, bool):
raise ValueError(f"{label}={value!r} is not a valid string value for codec {vcodec!r}.")
if isinstance(value, str):
str_val = value
elif isinstance(value, (int, float)):
str_val = str(value)
else:
raise ValueError(f"{label}={value!r} has unsupported type for STRING option on codec {vcodec!r}")
# Check string choice compatibility
choices = [c.name for c in (opt.choices or [])]
if choices and str_val not in choices:
raise ValueError(
f"{label}={str_val!r} is not a supported choice for codec "
f"{vcodec!r}; valid choices: {choices}"
)
else:
return
def _check_pixel_format(vcodec: str, pix_fmt: str) -> None:
formats = _get_codec_video_formats(vcodec)
if formats and pix_fmt not in formats:
raise ValueError(
f"pix_fmt={pix_fmt!r} is not supported by codec {vcodec!r}; "
f"supported pixel formats: {list(formats)}"
)
def _check_codec_options(vcodec: str, codec_options: dict[str, Any], config: VideoEncoderConfig) -> None:
"""Validate merged encoder options (typed) against the codec's published AVOptions."""
supported_options = _get_codec_options_by_name(vcodec)
for key, value in codec_options.items():
# GOP size is not a codec-specific option, it has to be validated separately.
if key == "g":
if isinstance(value, bool) or not isinstance(value, int) or value < 1:
raise ValueError(f"g={value!r} must be a positive integer for codec {vcodec!r}")
continue
if key not in supported_options:
continue
opt = supported_options[key]
label = f"extra_options[{key!r}]" if key in config.extra_options else key
_check_option_value(vcodec, label, value, opt)
def check_video_encoder_config_pyav(config: VideoEncoderConfig) -> None:
"""Verify *config* is compatible with the bundled FFmpeg build.
Checks pixel format, abstract tuning-field compatibility, and each merged
encoder option from :meth:`~lerobot.datasets.video_utils.VideoEncoderConfig.get_codec_options`
against PyAV (including numeric ``extra_options`` present in that dict).
No-op when ``config.vcodec`` isn't in the local FFmpeg build.
Raises:
ValueError: on the first incompatibility encountered.
"""
vcodec = config.vcodec
options = _get_codec_options_by_name(vcodec)
if not options:
logger.warning(
"Codec %r is not available in the bundled FFmpeg build; ",
vcodec,
)
return
_check_pixel_format(config.vcodec, config.pix_fmt)
_check_codec_options(config.vcodec, config.get_codec_options(), config)
+1 -1
View File
@@ -434,7 +434,7 @@ class StreamingLeRobotDataset(torch.utils.data.IterableDataset):
def _make_padding_camera_frame(self, camera_key: str):
"""Variable-shape padding frame for given camera keys, given in (H, W, C)"""
return torch.zeros(self.meta.info["features"][camera_key]["shape"]).permute(-1, 0, 1)
return torch.zeros(self.meta.info.features[camera_key]["shape"]).permute(-1, 0, 1)
def _get_video_frame_padding_mask(
self,
+131 -5
View File
@@ -14,9 +14,11 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import contextlib
import dataclasses
import importlib.resources
import json
import logging
from dataclasses import dataclass, field
from pathlib import Path
import datasets
@@ -70,9 +72,12 @@ class ForwardCompatibilityError(CompatibilityError):
super().__init__(message)
logger = logging.getLogger(__name__)
DEFAULT_CHUNK_SIZE = 1000 # Max number of files per chunk
DEFAULT_DATA_FILE_SIZE_IN_MB = 50 # Max size per file
DEFAULT_VIDEO_FILE_SIZE_IN_MB = 100 # Max size per file
DEFAULT_DATA_FILE_SIZE_IN_MB = 100 # Max size per file
DEFAULT_VIDEO_FILE_SIZE_IN_MB = 200 # Max size per file
INFO_PATH = "meta/info.json"
STATS_PATH = "meta/stats.json"
@@ -88,12 +93,133 @@ DEFAULT_EPISODES_PATH = EPISODES_DIR + "/" + CHUNK_FILE_PATTERN + ".parquet"
DEFAULT_DATA_PATH = DATA_DIR + "/" + CHUNK_FILE_PATTERN + ".parquet"
DEFAULT_VIDEO_PATH = VIDEO_DIR + "/{video_key}/" + CHUNK_FILE_PATTERN + ".mp4"
DEFAULT_IMAGE_PATH = "images/{image_key}/episode-{episode_index:06d}/frame-{frame_index:06d}.png"
# Depth maps live alongside images on disk but use TIFF instead of PNG: PNG
# cannot natively round-trip float32, and several common loaders silently
# downcast 16-bit grayscale.
DEFAULT_DEPTH_PATH = "images/{image_key}/episode-{episode_index:06d}/frame-{frame_index:06d}.tiff"
LEGACY_EPISODES_PATH = "meta/episodes.jsonl"
LEGACY_EPISODES_STATS_PATH = "meta/episodes_stats.jsonl"
LEGACY_TASKS_PATH = "meta/tasks.jsonl"
@dataclass
class DatasetInfo:
"""Typed representation of the ``meta/info.json`` file for a LeRobot dataset.
Replaces the previously untyped ``dict`` returned by ``load_info()`` and
created by ``create_empty_dataset_info()``. Using a dataclass provides
explicit field definitions, IDE auto-completion, and validation at
construction time.
"""
codebase_version: str
fps: int
features: dict[str, dict]
# Episode / frame counters — start at zero for new datasets
total_episodes: int = 0
total_frames: int = 0
total_tasks: int = 0
# Storage settings
chunks_size: int = field(default=DEFAULT_CHUNK_SIZE)
data_files_size_in_mb: int = field(default=DEFAULT_DATA_FILE_SIZE_IN_MB)
video_files_size_in_mb: int = field(default=DEFAULT_VIDEO_FILE_SIZE_IN_MB)
# File path templates
data_path: str = field(default=DEFAULT_DATA_PATH)
video_path: str | None = field(default=DEFAULT_VIDEO_PATH)
# Optional metadata
robot_type: str | None = None
splits: dict[str, str] = field(default_factory=dict)
def __post_init__(self) -> None:
# Coerce feature shapes from list to tuple — JSON deserialisation
# returns lists, but the rest of the codebase expects tuples.
for ft in self.features.values():
if isinstance(ft.get("shape"), list):
ft["shape"] = tuple(ft["shape"])
if self.fps <= 0:
raise ValueError(f"fps must be positive, got {self.fps}")
if self.chunks_size <= 0:
raise ValueError(f"chunks_size must be positive, got {self.chunks_size}")
if self.data_files_size_in_mb <= 0:
raise ValueError(f"data_files_size_in_mb must be positive, got {self.data_files_size_in_mb}")
if self.video_files_size_in_mb <= 0:
raise ValueError(f"video_files_size_in_mb must be positive, got {self.video_files_size_in_mb}")
def to_dict(self) -> dict:
"""Return a JSON-serialisable dict.
Converts tuple shapes back to lists so ``json.dump`` can handle them.
"""
d = dataclasses.asdict(self)
for ft in d["features"].values():
if isinstance(ft.get("shape"), tuple):
ft["shape"] = list(ft["shape"])
return d
@classmethod
def from_dict(cls, data: dict) -> "DatasetInfo":
"""Construct from a raw dict (e.g. loaded directly from JSON).
Unknown keys are ignored for forward compatibility with datasets that
carry additional fields (e.g. ``total_videos`` from v2.x). A warning is
logged when such fields are present.
"""
known = {f.name for f in dataclasses.fields(cls)}
unknown = sorted(k for k in data if k not in known)
if unknown:
logger.warning(f"Unknown fields in DatasetInfo: {unknown}. These will be ignored.")
return cls(**{k: v for k, v in data.items() if k in known})
# ---------------------------------------------------------------------------
# Temporary dict-style compatibility layer
# Allows existing ``info["key"]`` call-sites to keep working without changes.
# Once all callers have been migrated to attribute access, remove these.
# ---------------------------------------------------------------------------
def __getitem__(self, key: str):
import warnings
warnings.warn(
f"Accessing DatasetInfo with dict-style syntax info['{key}'] is deprecated. "
f"Use attribute access info.{key} instead.",
DeprecationWarning,
stacklevel=2,
)
try:
return getattr(self, key)
except AttributeError as err:
raise KeyError(key) from err
def __setitem__(self, key: str, value) -> None:
import warnings
warnings.warn(
f"Setting DatasetInfo with dict-style syntax info['{key}'] = ... is deprecated. "
f"Use attribute assignment info.{key} = ... instead.",
DeprecationWarning,
stacklevel=2,
)
if not hasattr(self, key):
raise KeyError(f"DatasetInfo has no field '{key}'")
setattr(self, key, value)
def __contains__(self, key: str) -> bool:
"""Check if a field exists (dict-like interface)."""
return hasattr(self, key)
def get(self, key: str, default=None):
"""Get attribute value with default fallback (dict-like interface)."""
try:
return getattr(self, key)
except AttributeError:
return default
def has_legacy_hub_download_metadata(root: Path) -> bool:
"""Return ``True`` when *root* looks like a legacy Hub ``local_dir`` mirror.
@@ -294,7 +420,7 @@ def create_branch(repo_id: str, *, branch: str, repo_type: str | None = None) ->
def create_lerobot_dataset_card(
tags: list | None = None,
dataset_info: dict | None = None,
dataset_info: DatasetInfo | None = None,
**kwargs,
) -> DatasetCard:
"""Create a `DatasetCard` for a LeRobot dataset.
@@ -305,7 +431,7 @@ def create_lerobot_dataset_card(
Args:
tags (list | None): A list of tags to add to the dataset card.
dataset_info (dict | None): The dataset's info dictionary, which will
dataset_info (DatasetInfo | None): The dataset's info object, which will
be displayed on the card.
**kwargs: Additional keyword arguments to populate the card template.
@@ -318,7 +444,7 @@ def create_lerobot_dataset_card(
card_tags += tags
if dataset_info:
dataset_structure = "[meta/info.json](meta/info.json):\n"
dataset_structure += f"```json\n{json.dumps(dataset_info, indent=4)}\n```\n"
dataset_structure += f"```json\n{json.dumps(dataset_info.to_dict(), indent=4)}\n```\n"
kwargs = {**kwargs, "dataset_structure": dataset_structure}
card_data = DatasetCardData(
license=kwargs.get("license"),
+517 -145
View File
@@ -17,12 +17,13 @@ import contextlib
import glob
import importlib
import logging
import math
import queue
import shutil
import tempfile
import threading
import warnings
from dataclasses import dataclass, field
from dataclasses import asdict, dataclass, field
from fractions import Fraction
from pathlib import Path
from threading import Lock
@@ -37,7 +38,23 @@ import torchvision
from datasets.features.features import register_feature
from PIL import Image
from lerobot.utils.import_utils import get_safe_default_codec
from lerobot.datasets.pyav_utils import (
check_video_encoder_config_pyav,
depth_to_video_frame,
detect_available_encoders_pyav,
decode_depth_frame,
encode_depth_frame_pyav,
decode_depth_frame_pyav,
)
from lerobot.datasets.depth_utils import (
quantize_depth,
dequantize_depth,
DEFAULT_DEPTH_MIN,
DEFAULT_DEPTH_MAX,
DEFAULT_DEPTH_SHIFT,
DEFAULT_DEPTH_USE_LOG,
)
from lerobot.utils.import_utils import get_safe_default_video_backend
logger = logging.getLogger(__name__)
@@ -52,70 +69,226 @@ HW_ENCODERS = [
"h264_qsv", # Intel Quick Sync
]
VALID_VIDEO_CODECS = {"h264", "hevc", "libsvtav1", "auto"} | set(HW_ENCODERS)
VALID_VIDEO_CODECS = {"h264", "hevc", "libsvtav1", "ffv1", "auto"} | set(HW_ENCODERS)
LIBSVTAV1_DEFAULT_PRESET: int = 12
def _get_codec_options(
vcodec: str,
g: int | None = 2,
crf: int | None = 30,
preset: int | None = None,
) -> dict:
"""Build codec-specific options dict for video encoding."""
options = {}
@dataclass
class VideoEncoderConfig:
"""Video encoder configuration.
# GOP size (keyframe interval) - supported by VideoToolbox and software encoders
if g is not None and (vcodec in ("h264_videotoolbox", "hevc_videotoolbox") or vcodec not in HW_ENCODERS):
options["g"] = str(g)
Attributes:
vcodec: FFmpeg encoder name. ``"auto"`` is resolved during
construction (HW encoder if available, else ``libsvtav1``).
pix_fmt: Pixel format (e.g. ``"yuv420p"``).
g: GOP size (keyframe interval).
crf: Quality level — mapped to the native quality parameter of the
codec (``crf`` for software, ``qp`` for NVENC/VAAPI,
``q:v`` for VideoToolbox, ``global_quality`` for QSV).
preset: Speed/quality preset. Accepted type is per-codec.
fast_decode: Fast-decode tuning. For ``libsvtav1`` this is a level (0-2)
embedded in ``svtav1-params``. For ``h264`` and ``hevc`` non-zero values
set ``tune=fastdecode``. Ignored for other codecs.
video_backend: Python library driving FFmpeg for encoding. Only ``"pyav"``
is currently supported.
extra_options: Free-form dictionary of additional FFmpeg options
(e.g. ``{"tune": "film", "profile:v": "high", "bf": 2}``).
"""
# Quality control (codec-specific parameter names)
if crf is not None:
if vcodec in ("h264", "hevc", "libsvtav1"):
options["crf"] = str(crf)
elif vcodec in ("h264_videotoolbox", "hevc_videotoolbox"):
quality = max(1, min(100, int(100 - crf * 2)))
options["q:v"] = str(quality)
elif vcodec in ("h264_nvenc", "hevc_nvenc"):
options["rc"] = "constqp"
options["qp"] = str(crf)
elif vcodec in ("h264_vaapi",):
options["qp"] = str(crf)
elif vcodec in ("h264_qsv",):
options["global_quality"] = str(crf)
vcodec: str = "libsvtav1"
pix_fmt: str = "yuv420p"
g: int | None = 2
crf: int | None = 30
preset: int | str | None = None
fast_decode: int = 0
# TODO(CarolinePascal): add torchcodec support + find a way to unify the
# two backends (encoding and decoding).
video_backend: str = "pyav"
extra_options: dict[str, Any] = field(default_factory=dict)
# Preset (only for libsvtav1)
if vcodec == "libsvtav1":
options["preset"] = str(preset) if preset is not None else "12"
# Class-level marker persisted to ``info.json`` (via ``asdict``) so the
# reader can tell depth datasets from RGB ones without a separate dispatch
# path. ``init=False`` keeps it out of CLI/constructor surface; subclasses
# flip the default (see :class:`DepthEncoderConfig`).
is_depth_map: bool = field(default=False, init=False)
return options
def __post_init__(self) -> None:
self.resolve_vcodec()
# Empty-constructor ergonomics: ``VideoEncoderConfig()`` must "just work".
if self.preset is None and self.vcodec == "libsvtav1":
self.preset = LIBSVTAV1_DEFAULT_PRESET
self.validate()
def detect_available_encoders(self, encoders: list[str] | str) -> list[str]:
"""Detect available encoders based on the video backend."""
if self.video_backend == "pyav":
return detect_available_encoders_pyav(encoders)
else:
return []
def validate(self) -> None:
"""Validate the video encoder config."""
if self.video_backend == "pyav":
check_video_encoder_config_pyav(self)
def resolve_vcodec(self) -> None:
"""Validate vcodec and resolve 'auto' to best available HW encoder, fallback to libsvtav1.
Any explicitly-requested codec that isn't in the local FFmpeg build is
also silently rewritten to ``libsvtav1`` so encoding never hard-fails on
a host missing the requested encoder.
"""
# Backward compatibility: older datasets persist ``vcodec="av1"`` in
# ``info.json``. Rewrite to the canonical encoder name *before* the
# validation check below so loading those datasets keeps working.
if self.vcodec == "av1":
self.vcodec = "libsvtav1"
if self.vcodec not in VALID_VIDEO_CODECS:
raise ValueError(f"Invalid vcodec '{self.vcodec}'. Must be one of: {sorted(VALID_VIDEO_CODECS)}")
if self.vcodec == "auto":
available = self.detect_available_encoders(HW_ENCODERS)
for encoder in HW_ENCODERS:
if encoder in available:
logger.info(f"Auto-selected video codec: {encoder}")
self.vcodec = encoder
return
logger.info("No hardware encoder available, falling back to software encoder 'libsvtav1'")
self.vcodec = "libsvtav1"
if self.detect_available_encoders(self.vcodec):
logger.info(f"Using video codec: {self.vcodec}")
self.vcodec = self.vcodec
return
raise ValueError(f"Unsupported video codec: {self.vcodec} with video backend {self.video_backend}")
def get_codec_options(
self, encoder_threads: int | None = None, as_strings: bool = False
) -> dict[str, str]:
"""Translate the tuning fields to codec-specific FFmpeg options.
``VideoEncoderConfig.extra_options`` are merged last but never override a structured field.
Args:
encoder_threads: Number of encoder threads set globally for all VideoEncoderConfigs.
For libsvtav1, this is mapped to ``lp`` via ``svtav1-params``.
For h264/hevc, this is mapped to ``threads``.
Hardware encoders ignore this parameter.
as_strings: If ``True``, casts values to strings.
"""
opts: dict[str, Any] = {}
def set_if(key: str, value: Any) -> None:
if value is not None:
opts[key] = value if not as_strings else str(value)
# GOP size is not a codec-specific option, so it is always set.
set_if("g", self.g)
if self.vcodec == "libsvtav1":
set_if("crf", self.crf)
set_if("preset", self.preset)
svtav1_parts: list[str] = []
if self.fast_decode is not None:
svtav1_parts.append(f"fast-decode={max(0, min(2, self.fast_decode))}")
if encoder_threads is not None:
svtav1_parts.append(f"lp={encoder_threads}")
if svtav1_parts:
opts["svtav1-params"] = ":".join(svtav1_parts)
elif self.vcodec in ("h264", "hevc"):
set_if("crf", self.crf)
set_if("preset", self.preset)
if self.fast_decode:
opts["tune"] = "fastdecode"
set_if("threads", encoder_threads)
elif self.vcodec in ("h264_videotoolbox", "hevc_videotoolbox"):
if self.crf is not None:
opts["q:v"] = max(1, min(100, 100 - self.crf * 2))
elif self.vcodec in ("h264_nvenc", "hevc_nvenc"):
opts["rc"] = "constqp"
set_if("qp", self.crf)
set_if("preset", self.preset)
elif self.vcodec == "h264_vaapi":
set_if("qp", self.crf)
elif self.vcodec == "h264_qsv":
set_if("global_quality", self.crf)
set_if("preset", self.preset)
elif self.vcodec == "ffv1":
# Lossless intra-frame codec. ``crf``/``preset``/``fast_decode``
# are not meaningful.
set_if("threads", encoder_threads)
else:
set_if("crf", self.crf)
set_if("preset", self.preset)
# Extra options are merged last but never override structured fields (values are kept as given).
for k, v in self.extra_options.items():
if k not in opts:
set_if(k, v)
return opts
def detect_available_hw_encoders() -> list[str]:
"""Probe PyAV/FFmpeg for available hardware video encoders."""
available = []
for codec_name in HW_ENCODERS:
try:
av.codec.Codec(codec_name, "w")
available.append(codec_name)
except Exception: # nosec B110
logger.debug("HW encoder '%s' not available", codec_name) # nosec B110
return available
@dataclass
class DepthEncoderConfig(VideoEncoderConfig):
"""Encoder configuration for depth-map streams.
Inherits the full :class:`VideoEncoderConfig` surface (codec, GOP, CRF,
preset, ``extra_options``…) and adds the four parameters of the depth
quantization pipeline (:func:`quantize_depth`). Inheritance — rather
than composition — keeps the CLI flat: ``--dataset.depth_encoder_config.<field>``
works identically to its RGB counterpart.
Defaults flip ``vcodec`` to ``"hevc"`` (Main 12 profile) and ``pix_fmt``
to ``"yuv420p12le"``, the most widely available 12-bit pixel format.
For archive-grade lossless storage use ``vcodec="ffv1"`` together with
``pix_fmt="gray12le"`` (and clear ``crf``/``preset`` to ``None`` since
``ffv1`` doesn't expose those tuning knobs).
The :attr:`is_depth_map` marker is class-fixed to ``True`` (``init=False``,
so it's hidden from CLI and constructor args) and is what the reader
side keys on to tell depth datasets from RGB ones.
Attributes:
depth_min: Minimum depth in physical units (e.g. metres) represented
by quantum ``0``.
depth_max: Maximum depth represented by quantum :data:`DEPTH_QMAX`.
shift: Pre-log offset for numerical stability near zero.
use_log: ``True`` for logarithmic quantization (default; matches
sensor error profile), ``False`` for linear.
"""
vcodec: str = "hevc"
pix_fmt: str = "yuv420p12le"
depth_min: float = DEFAULT_DEPTH_MIN
depth_max: float = DEFAULT_DEPTH_MAX
shift: float = DEFAULT_DEPTH_SHIFT
use_log: bool = DEFAULT_DEPTH_USE_LOG
# Class invariant — kept out of ``__init__`` (and CLI) but persisted
# via ``asdict`` into ``info.json`` for the reader to detect depth.
is_depth_map: bool = field(default=True, init=False)
def quantize(self, depth: torch.Tensor | np.ndarray) -> torch.Tensor:
"""Apply :func:`quantize_depth` bound to this config's parameters."""
return quantize_depth(depth, self.depth_min, self.depth_max, self.shift, self.use_log)
def dequantize(self, quantized: torch.Tensor | np.ndarray) -> torch.Tensor:
"""Apply :func:`dequantize_depth` bound to this config's parameters."""
return dequantize_depth(quantized, self.depth_min, self.depth_max, self.shift, self.use_log)
def resolve_vcodec(vcodec: str) -> str:
"""Validate vcodec and resolve 'auto' to best available HW encoder, fallback to libsvtav1."""
if vcodec not in VALID_VIDEO_CODECS:
raise ValueError(f"Invalid vcodec '{vcodec}'. Must be one of: {sorted(VALID_VIDEO_CODECS)}")
if vcodec != "auto":
logger.info(f"Using video codec: {vcodec}")
return vcodec
available = detect_available_hw_encoders()
for encoder in HW_ENCODERS:
if encoder in available:
logger.info(f"Auto-selected video codec: {encoder}")
return encoder
logger.info("No hardware encoder available, falling back to software encoder 'libsvtav1'")
return "libsvtav1"
def depth_encoder_defaults() -> DepthEncoderConfig:
"""Return a :class:`DepthEncoderConfig` with depth-camera defaults."""
return DepthEncoderConfig()
def camera_encoder_defaults() -> VideoEncoderConfig:
"""Return a :class:`VideoEncoderConfig` with RGB-camera defaults."""
return VideoEncoderConfig()
def decode_video_frames(
@@ -142,7 +315,7 @@ def decode_video_frames(
Currently supports torchcodec on cpu and pyav.
"""
if backend is None:
backend = get_safe_default_codec()
backend = get_safe_default_video_backend()
if backend == "torchcodec":
return decode_video_frames_torchcodec(video_path, timestamps, tolerance_s, return_uint8=return_uint8)
elif backend in ["pyav", "video_reader"]:
@@ -396,22 +569,136 @@ def decode_video_frames_torchcodec(
return closest_frames
def decode_depth_frames(
video_path: Path | str,
timestamps: list[float],
tolerance_s: float,
*,
depth_min: float = DEFAULT_DEPTH_MIN,
depth_max: float = DEFAULT_DEPTH_MAX,
shift: float = DEFAULT_DEPTH_SHIFT,
use_log: bool = DEFAULT_DEPTH_USE_LOG,
return_quantized: bool = False,
log_loaded_timestamps: bool = False,
) -> torch.Tensor:
"""Decode depth-map frames at the requested timestamps using PyAV.
Mirrors the timestamp-tolerance / closest-frame contract of
:func:`decode_video_frames` but operates entirely through PyAV (the
``torchvision`` and ``torchcodec`` backends don't currently round-trip
12-bit pixel formats reliably).
Each decoded frame is reformatted to ``gray12le`` so the same path
handles ``yuv420p12le`` (HEVC default) and ``gray12le`` (ffv1 archive)
sources transparently.
Args:
video_path: Path to a depth video produced with a
:class:`DepthEncoderConfig`.
timestamps: Frame timestamps to retrieve, in seconds.
tolerance_s: Maximum allowed deviation between the queried and the
actually-decoded timestamps.
depth_min, depth_max, shift, use_log: Parameters used at quantization
time. Should match :func:`info_to_depth_kwargs` extracted from
``info.json`` for the source dataset.
return_quantized: If ``True``, skip the dequantization step and
return raw 12-bit ``uint16`` quanta.
log_loaded_timestamps: Debug logging.
Returns:
``torch.Tensor`` of shape ``(N, H, W)``:
* ``dtype=torch.float32`` (metric depth, default)
* ``dtype=torch.uint16`` when ``return_quantized=True``.
Raises:
FrameTimestampError: If a query timestamp can't be matched within
*tolerance_s*, or if no frames are decoded.
"""
video_path_str = str(video_path)
first_ts = min(timestamps)
last_ts = max(timestamps)
loaded_frames: list[np.ndarray] = []
loaded_ts: list[float] = []
av.logging.set_level(av.logging.WARNING)
with av.open(video_path_str, "r") as container:
try:
stream = container.streams.video[0]
except IndexError as e:
raise FrameTimestampError(f"No video stream in {video_path_str}") from e
# Seek to the keyframe at-or-before first_ts (PyAV doesn't do
# accurate seek, so we still iterate forward to the requested range).
seek_pts = int(first_ts / stream.time_base)
container.seek(seek_pts, stream=stream, any_frame=False, backward=True)
for frame in container.decode(stream):
if frame.pts is None:
continue
current_ts = float(frame.pts * stream.time_base)
if log_loaded_timestamps:
logger.info(f"depth frame loaded at timestamp={current_ts:.4f}")
loaded_frames.append(
decode_depth_frame(
frame,
depth_min=depth_min,
depth_max=depth_max,
shift=shift,
use_log=use_log,
return_quantized=True,
)
)
loaded_ts.append(current_ts)
if current_ts >= last_ts:
break
av.logging.restore_default_callback()
if not loaded_frames:
raise FrameTimestampError(
f"No depth frames decoded from {video_path_str} for timestamps {timestamps}"
)
query_ts = torch.tensor(timestamps)
loaded_ts_t = torch.tensor(loaded_ts)
dist = torch.cdist(query_ts[:, None], loaded_ts_t[:, None], p=1)
min_, argmin_ = dist.min(1)
is_within_tol = min_ < tolerance_s
if not is_within_tol.all():
raise FrameTimestampError(
f"One or several query timestamps violate the tolerance "
f"({min_[~is_within_tol]} > {tolerance_s=})."
f"\nqueried timestamps: {query_ts}"
f"\nloaded timestamps: {loaded_ts_t}"
f"\nvideo: {video_path_str}"
)
closest = np.stack([loaded_frames[i] for i in argmin_]) # (N, H, W) uint16
quantized = torch.from_numpy(closest)
if return_quantized:
return quantized
return dequantize_depth(quantized, depth_min, depth_max, shift, use_log)
def encode_video_frames(
imgs_dir: Path | str,
video_path: Path | str,
fps: int,
vcodec: str = "libsvtav1",
pix_fmt: str = "yuv420p",
g: int | None = 2,
crf: int | None = 30,
fast_decode: int = 0,
camera_encoder_config: VideoEncoderConfig | None = None,
encoder_threads: int | None = None,
*,
log_level: int | None = av.logging.WARNING,
overwrite: bool = False,
preset: int | None = None,
encoder_threads: int | None = None,
) -> None:
"""More info on ffmpeg arguments tuning on `benchmark/video/README.md`"""
vcodec = resolve_vcodec(vcodec)
if camera_encoder_config is None:
camera_encoder_config = VideoEncoderConfig()
vcodec = camera_encoder_config.vcodec
pix_fmt = camera_encoder_config.pix_fmt
video_path = Path(video_path)
imgs_dir = Path(imgs_dir)
@@ -422,42 +709,18 @@ def encode_video_frames(
video_path.parent.mkdir(parents=True, exist_ok=True)
# Encoders/pixel formats incompatibility check
if (vcodec == "libsvtav1" or vcodec == "hevc") and pix_fmt == "yuv444p":
logger.warning(
f"Incompatible pixel format 'yuv444p' for codec {vcodec}, auto-selecting format 'yuv420p'"
)
pix_fmt = "yuv420p"
# Get input frames
template = "frame-" + ("[0-9]" * 6) + ".png"
input_list = sorted(
glob.glob(str(imgs_dir / template)), key=lambda x: int(x.split("-")[-1].split(".")[0])
)
# Define video output frame size (assuming all input frames are the same size)
if len(input_list) == 0:
raise FileNotFoundError(f"No images found in {imgs_dir}.")
with Image.open(input_list[0]) as dummy_image:
width, height = dummy_image.size
# Define video codec options
video_options = _get_codec_options(vcodec, g, crf, preset)
if fast_decode:
key = "svtav1-params" if vcodec == "libsvtav1" else "tune"
value = f"fast-decode={fast_decode}" if vcodec == "libsvtav1" else "fastdecode"
video_options[key] = value
if encoder_threads is not None:
if vcodec == "libsvtav1":
lp_param = f"lp={encoder_threads}"
if "svtav1-params" in video_options:
video_options["svtav1-params"] += f":{lp_param}"
else:
video_options["svtav1-params"] = lp_param
else:
video_options["threads"] = str(encoder_threads)
video_options = camera_encoder_config.get_codec_options(encoder_threads, as_strings=True)
# Set logging level
if log_level is not None:
@@ -494,7 +757,10 @@ def encode_video_frames(
def concatenate_video_files(
input_video_paths: list[Path | str], output_video_path: Path, overwrite: bool = True
input_video_paths: list[Path | str],
output_video_path: Path,
overwrite: bool = True,
compatibility_check: bool = False,
):
"""
Concatenate multiple video files into a single video file using pyav.
@@ -507,6 +773,7 @@ def concatenate_video_files(
input_video_paths: Ordered list of input video file paths to concatenate.
output_video_path: Path to the output video file.
overwrite: Whether to overwrite the output video file if it already exists. Default is True.
compatibility_check: Whether to check if the input videos are compatible. Default is False.
Note:
- Creates a temporary directory for intermediate files that is cleaned up after use.
@@ -525,6 +792,22 @@ def concatenate_video_files(
if len(input_video_paths) == 0:
raise FileNotFoundError("No input video paths provided.")
# This check may be skipped at recording time as videos are encoded with the same encoder config.
if compatibility_check:
reference_video_info = get_video_info(input_video_paths[0])
for input_path in input_video_paths[1:]:
video_info = get_video_info(input_path)
if (
video_info["video.height"] != reference_video_info["video.height"]
or video_info["video.width"] != reference_video_info["video.width"]
or video_info["video.fps"] != reference_video_info["video.fps"]
or video_info["video.codec"] != reference_video_info["video.codec"]
or video_info["video.pix_fmt"] != reference_video_info["video.pix_fmt"]
):
raise ValueError(
f"Input video {input_path} is not compatible with the reference video {input_video_paths[0]}."
)
# Create a temporary .ffconcat file to list the input video paths
with tempfile.NamedTemporaryFile(mode="w", suffix=".ffconcat", delete=False) as tmp_concatenate_file:
tmp_concatenate_file.write("ffconcat version 1.0\n")
@@ -591,33 +874,31 @@ class _CameraEncoderThread(threading.Thread):
fps: int,
vcodec: str,
pix_fmt: str,
g: int | None,
crf: int | None,
preset: int | None,
codec_options: dict[str, str],
frame_queue: queue.Queue,
result_queue: queue.Queue,
stop_event: threading.Event,
encoder_threads: int | None = None,
depth_encoder_config: "DepthEncoderConfig | None" = None,
):
super().__init__(daemon=True)
self.video_path = video_path
self.fps = fps
self.vcodec = vcodec
self.pix_fmt = pix_fmt
self.g = g
self.crf = crf
self.preset = preset
self.codec_options = codec_options
self.frame_queue = frame_queue
self.result_queue = result_queue
self.stop_event = stop_event
self.encoder_threads = encoder_threads
self.depth_encoder_config = depth_encoder_config
def run(self) -> None:
from .compute_stats import RunningQuantileStats, auto_downsample_height_width
container = None
output_stream = None
stats_tracker = RunningQuantileStats()
is_depth = self.depth_encoder_config is not None
stats_tracker = RunningQuantileStats() if not is_depth else None
frame_count = 0
try:
@@ -635,51 +916,45 @@ class _CameraEncoderThread(threading.Thread):
# Sentinel: flush and close
break
# Ensure HWC uint8 numpy array
# Ensure HWC (RGB or depth) uint8 (RGB only) numpy array
if isinstance(frame_data, np.ndarray):
if frame_data.ndim == 3 and frame_data.shape[0] == 3:
# CHW -> HWC
frame_data = frame_data.transpose(1, 2, 0)
if frame_data.dtype != np.uint8:
if frame_data.dtype != np.uint8 and not is_depth:
frame_data = (frame_data * 255).astype(np.uint8)
# Open container on first frame (to get width/height)
if container is None:
height, width = frame_data.shape[:2]
video_options = _get_codec_options(self.vcodec, self.g, self.crf, self.preset)
if self.encoder_threads is not None:
if self.vcodec == "libsvtav1":
lp_param = f"lp={self.encoder_threads}"
if "svtav1-params" in video_options:
video_options["svtav1-params"] += f":{lp_param}"
else:
video_options["svtav1-params"] = lp_param
else:
video_options["threads"] = str(self.encoder_threads)
Path(self.video_path).parent.mkdir(parents=True, exist_ok=True)
container = av.open(str(self.video_path), "w")
output_stream = container.add_stream(self.vcodec, self.fps, options=video_options)
output_stream = container.add_stream(self.vcodec, self.fps, options=self.codec_options)
output_stream.pix_fmt = self.pix_fmt
output_stream.width = width
output_stream.height = height
output_stream.time_base = Fraction(1, self.fps)
# Encode frame with explicit timestamps
pil_img = Image.fromarray(frame_data)
video_frame = av.VideoFrame.from_image(pil_img)
if is_depth:
video_frame = encode_depth_frame_pyav(frame_data, pix_fmt=self.pix_fmt, depth_min=self.depth_encoder_config.depth_min, depth_max=self.depth_encoder_config.depth_max, shift=self.depth_encoder_config.shift, use_log=self.depth_encoder_config.use_log)
else:
pil_img = Image.fromarray(frame_data)
video_frame = av.VideoFrame.from_image(pil_img)
video_frame.pts = frame_count
video_frame.time_base = Fraction(1, self.fps)
packet = output_stream.encode(video_frame)
if packet:
container.mux(packet)
# Update stats with downsampled frame (per-channel stats like compute_episode_stats)
img_chw = frame_data.transpose(2, 0, 1) # HWC -> CHW
img_downsampled = auto_downsample_height_width(img_chw)
# Reshape CHW to (H*W, C) for per-channel stats
channels = img_downsampled.shape[0]
img_for_stats = img_downsampled.transpose(1, 2, 0).reshape(-1, channels)
stats_tracker.update(img_for_stats)
if not is_depth:
# Update stats with downsampled frame (per-channel stats like compute_episode_stats)
img_chw = frame_data.transpose(2, 0, 1) # HWC -> CHW
img_downsampled = auto_downsample_height_width(img_chw)
# Reshape CHW to (H*W, C) for per-channel stats
channels = img_downsampled.shape[0]
img_for_stats = img_downsampled.transpose(1, 2, 0).reshape(-1, channels)
stats_tracker.update(img_for_stats)
frame_count += 1
@@ -694,8 +969,10 @@ class _CameraEncoderThread(threading.Thread):
av.logging.restore_default_callback()
# Get stats and put on result queue
if frame_count >= 2:
# Get stats and put on result queue (depth streams skip stats)
if is_depth:
self.result_queue.put(("ok", None))
elif frame_count >= 2:
stats = stats_tracker.get_statistics()
self.result_queue.put(("ok", stats))
else:
@@ -724,22 +1001,40 @@ class StreamingVideoEncoder:
def __init__(
self,
fps: int,
vcodec: str = "libsvtav1",
pix_fmt: str = "yuv420p",
g: int | None = 2,
crf: int | None = 30,
preset: int | None = None,
queue_maxsize: int = 30,
camera_encoder_config: VideoEncoderConfig | None = None,
encoder_threads: int | None = None,
*,
queue_maxsize: int = 30,
depth_encoder_config: "DepthEncoderConfig | None" = None,
depth_keys: list[str] | None = None,
):
"""
Args:
fps: Frames per second for the output videos.
camera_encoder_config: Video encoder settings applied to all cameras.
When ``None``, :class:`VideoEncoderConfig` defaults are used.
encoder_threads: Number of encoder threads (global setting).
``None`` lets the codec decide.
queue_maxsize: Max frames to buffer per camera before
back-pressure drops frames.
depth_encoder_config: Optional depth encoder configuration applied
to all depth video keys listed in ``depth_keys``.
depth_keys: Video keys (matching the dataset feature names) that
must be encoded as quantized depth maps using
``depth_encoder_config``. Required when ``depth_encoder_config``
is provided.
"""
self.fps = fps
self.vcodec = resolve_vcodec(vcodec)
self.pix_fmt = pix_fmt
self.g = g
self.crf = crf
self.preset = preset
self._camera_encoder_config = camera_encoder_config or VideoEncoderConfig()
self._encoder_threads = encoder_threads
self.queue_maxsize = queue_maxsize
self.encoder_threads = encoder_threads
self._depth_encoder_config = depth_encoder_config
self._depth_keys: set[str] = set(depth_keys or [])
if self._depth_keys and self._depth_encoder_config is None:
raise ValueError(
"StreamingVideoEncoder received depth_keys without a depth_encoder_config; "
"either pass a DepthEncoderConfig or remove depth_keys."
)
self._frame_queues: dict[str, queue.Queue] = {}
self._result_queues: dict[str, queue.Queue] = {}
@@ -770,18 +1065,28 @@ class StreamingVideoEncoder:
temp_video_dir = Path(tempfile.mkdtemp(dir=temp_dir))
video_path = temp_video_dir / f"{video_key.replace('/', '_')}_streaming.mp4"
is_depth_key = video_key in self._depth_keys
encoder_cfg: VideoEncoderConfig
depth_cfg = None
if is_depth_key:
assert self._depth_encoder_config is not None # guaranteed by __init__
encoder_cfg = self._depth_encoder_config
depth_cfg = self._depth_encoder_config
else:
encoder_cfg = self._camera_encoder_config
vcodec = encoder_cfg.vcodec
codec_options = encoder_cfg.get_codec_options(self._encoder_threads)
encoder_thread = _CameraEncoderThread(
video_path=video_path,
fps=self.fps,
vcodec=self.vcodec,
pix_fmt=self.pix_fmt,
g=self.g,
crf=self.crf,
preset=self.preset,
vcodec=vcodec,
pix_fmt=encoder_cfg.pix_fmt,
codec_options=codec_options,
frame_queue=frame_queue,
result_queue=result_queue,
stop_event=stop_event,
encoder_threads=self.encoder_threads,
depth_encoder_config=depth_cfg,
)
encoder_thread.start()
@@ -986,8 +1291,18 @@ def get_audio_info(video_path: Path | str) -> dict:
return audio_info
def get_video_info(video_path: Path | str) -> dict:
# Set logging level
def get_video_info(
video_path: Path | str,
video_encoder_config: "VideoEncoderConfig | None" = None,
) -> dict:
"""Build the ``video.*`` / ``audio.*`` info dict persisted in ``info.json``.
Args:
video_path: Path to the encoded video file to probe.
video_encoder_config: If provided, record the exact encoder settings used to encode this
video. Stream-derived values take precedence — encoder fields are only written for keys
not already populated from the video file itself.
"""
logging.getLogger("libav").setLevel(av.logging.WARNING)
# Getting video stream information
@@ -1004,7 +1319,6 @@ def get_video_info(video_path: Path | str) -> dict:
video_info["video.width"] = video_stream.width
video_info["video.codec"] = video_stream.codec.canonical_name
video_info["video.pix_fmt"] = video_stream.pix_fmt
video_info["video.is_depth_map"] = False
# Calculate fps from r_frame_rate
video_info["video.fps"] = int(video_stream.base_rate)
@@ -1018,9 +1332,67 @@ def get_video_info(video_path: Path | str) -> dict:
# Adding audio stream information
video_info.update(**get_audio_info(video_path))
# Add additional encoder configuration if provided (no override of stream-derived values)
# Depth related fields flow naturally through this path.
if video_encoder_config is not None:
for field_name, field_value in asdict(video_encoder_config).items():
video_info.setdefault(f"video.{field_name}", field_value)
# Fallback case where no encoder config is provided or the video is not a depth map.
video_info.setdefault("video.is_depth_map", False)
return video_info
# ─── Depth metadata helpers (reader side) ────────────────────────────
_DEPTH_INFO_KEYS: tuple[str, ...] = (
"video.depth_min",
"video.depth_max",
"video.shift",
"video.use_log",
)
def seed_depth_feature_info(
features: dict[str, dict],
depth_encoder_config: "DepthEncoderConfig | None",
) -> None:
"""Pre-populate per-feature ``video.<field>`` entries from *depth_encoder_config*.
``update_video_info`` only runs after the first episode video is encoded,
so without this seeding step ``features[key]["info"]`` carries no
quantization range until then. Consumers that read the dataset feature
spec mid-recording (e.g. the rerun visualizer pinning the depth colormap
to ``video.depth_min`` / ``video.depth_max``) would otherwise see no
range during episode 1 and re-normalize per frame.
Stream-derived values written later by :func:`get_video_info` /
``update_video_info`` win over these seeds (the merge is
``{**existing, **stream_info}``), so callers can safely re-run this on
a partially-populated info dict.
No-op when ``depth_encoder_config`` is ``None`` or no feature is flagged
as a depth map.
"""
if depth_encoder_config is None:
return
encoder_fields = {
f"video.{name}": value for name, value in asdict(depth_encoder_config).items()
}
for ft in features.values():
if ft.get("dtype") != "video":
continue
info = ft.get("info") or {}
if not info.get("video.is_depth_map", False):
continue
# Only fill fields not already set, so explicit user-provided info is preserved.
for k, v in encoder_fields.items():
info.setdefault(k, v)
ft["info"] = info
def get_video_pixel_channels(pix_fmt: str) -> int:
if "gray" in pix_fmt or "depth" in pix_fmt or "monochrome" in pix_fmt:
return 1
-4
View File
@@ -24,8 +24,6 @@ from .pi0_fast.configuration_pi0_fast import PI0FastConfig as PI0FastConfig
from .pi05.configuration_pi05 import PI05Config as PI05Config
from .pretrained import PreTrainedPolicy as PreTrainedPolicy
from .sac.configuration_sac import SACConfig as SACConfig
from .sac.reward_model.configuration_classifier import RewardClassifierConfig as RewardClassifierConfig
from .sarm.configuration_sarm import SARMConfig as SARMConfig
from .smolvla.configuration_smolvla import SmolVLAConfig as SmolVLAConfig
from .tdmpc.configuration_tdmpc import TDMPCConfig as TDMPCConfig
from .utils import make_robot_action, prepare_observation_for_inference
@@ -46,9 +44,7 @@ __all__ = [
"PI0Config",
"PI0FastConfig",
"PI05Config",
"RewardClassifierConfig",
"SACConfig",
"SARMConfig",
"SmolVLAConfig",
"TDMPCConfig",
"VQBeTConfig",
+4 -3
View File
@@ -142,9 +142,10 @@ class ACTPolicy(PreTrainedPolicy):
actions_hat, (mu_hat, log_sigma_x2_hat) = self.model(batch)
l1_loss = (
F.l1_loss(batch[ACTION], actions_hat, reduction="none") * ~batch["action_is_pad"].unsqueeze(-1)
).mean()
abs_err = F.l1_loss(batch[ACTION], actions_hat, reduction="none")
valid_mask = ~batch["action_is_pad"].unsqueeze(-1)
num_valid = valid_mask.sum() * abs_err.shape[-1]
l1_loss = (abs_err * valid_mask).sum() / num_valid.clamp_min(1)
loss_dict = {"l1_loss": l1_loss.item()}
if self.config.use_vae:
@@ -380,7 +380,9 @@ class DiffusionModel(nn.Module):
f"{self.config.do_mask_loss_for_padding=}."
)
in_episode_bound = ~batch["action_is_pad"]
loss = loss * in_episode_bound.unsqueeze(-1)
mask = in_episode_bound.unsqueeze(-1)
num_valid = mask.sum() * loss.shape[-1]
return (loss * mask).sum() / num_valid.clamp_min(1)
return loss.mean()
+2 -30
View File
@@ -52,8 +52,6 @@ from .pi0.configuration_pi0 import PI0Config
from .pi05.configuration_pi05 import PI05Config
from .pretrained import PreTrainedPolicy
from .sac.configuration_sac import SACConfig
from .sac.reward_model.configuration_classifier import RewardClassifierConfig
from .sarm.configuration_sarm import SARMConfig
from .smolvla.configuration_smolvla import SmolVLAConfig
from .tdmpc.configuration_tdmpc import TDMPCConfig
from .utils import validate_visual_features_consistency
@@ -89,7 +87,7 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:
Args:
name: The name of the policy. Supported names are "tdmpc", "diffusion", "act",
"multi_task_dit", "vqbet", "pi0", "pi05", "sac", "reward_classifier", "smolvla", "wall_x".
"multi_task_dit", "vqbet", "pi0", "pi05", "sac", "smolvla", "wall_x".
Returns:
The policy class corresponding to the given name.
@@ -132,18 +130,10 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:
from .sac.modeling_sac import SACPolicy
return SACPolicy
elif name == "reward_classifier":
from .sac.reward_model.modeling_classifier import Classifier
return Classifier
elif name == "smolvla":
from .smolvla.modeling_smolvla import SmolVLAPolicy
return SmolVLAPolicy
elif name == "sarm":
from .sarm.modeling_sarm import SARMRewardModel
return SARMRewardModel
elif name == "groot":
from .groot.modeling_groot import GrootPolicy
@@ -173,7 +163,7 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:
Args:
policy_type: The type of the policy. Supported types include "tdmpc",
"multi_task_dit", "diffusion", "act", "vqbet", "pi0", "pi05", "sac",
"smolvla", "reward_classifier", "wall_x".
"smolvla", "wall_x".
**kwargs: Keyword arguments to be passed to the configuration class constructor.
Returns:
@@ -200,8 +190,6 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:
return SACConfig(**kwargs)
elif policy_type == "smolvla":
return SmolVLAConfig(**kwargs)
elif policy_type == "reward_classifier":
return RewardClassifierConfig(**kwargs)
elif policy_type == "groot":
return GrootConfig(**kwargs)
elif policy_type == "xvla":
@@ -378,14 +366,6 @@ def make_pre_post_processors(
dataset_stats=kwargs.get("dataset_stats"),
)
elif isinstance(policy_cfg, RewardClassifierConfig):
from .sac.reward_model.processor_classifier import make_classifier_processor
processors = make_classifier_processor(
config=policy_cfg,
dataset_stats=kwargs.get("dataset_stats"),
)
elif isinstance(policy_cfg, SmolVLAConfig):
from .smolvla.processor_smolvla import make_smolvla_pre_post_processors
@@ -394,14 +374,6 @@ def make_pre_post_processors(
dataset_stats=kwargs.get("dataset_stats"),
)
elif isinstance(policy_cfg, SARMConfig):
from .sarm.processor_sarm import make_sarm_pre_post_processors
processors = make_sarm_pre_post_processors(
config=policy_cfg,
dataset_stats=kwargs.get("dataset_stats"),
dataset_meta=kwargs.get("dataset_meta"),
)
elif isinstance(policy_cfg, GrootConfig):
from .groot.processor_groot import make_groot_pre_post_processors
+5 -9
View File
@@ -13,7 +13,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass, field
from pathlib import Path
from typing import TYPE_CHECKING
@@ -174,17 +173,14 @@ N_COLOR_CHANNELS = 3
# config
@dataclass
class GR00TN15Config(PretrainedConfig):
model_type = "gr00t_n1_5"
backbone_cfg: dict = field(init=False, metadata={"help": "Backbone configuration."})
action_head_cfg: dict = field(init=False, metadata={"help": "Action head configuration."})
action_horizon: int = field(init=False, metadata={"help": "Action horizon."})
action_dim: int = field(init=False, metadata={"help": "Action dimension."})
compute_dtype: str = field(default="float32", metadata={"help": "Compute dtype."})
backbone_cfg: dict
action_head_cfg: dict
action_horizon: int
action_dim: int
compute_dtype: str = "float32"
def __init__(self, **kwargs):
super().__init__(**kwargs)
@@ -688,8 +688,9 @@ class DiffusionObjective(nn.Module):
loss = F.mse_loss(predicted, target, reduction="none")
if self.do_mask_loss_for_padding and "action_is_pad" in batch:
valid_actions = ~batch["action_is_pad"]
loss = loss * valid_actions.unsqueeze(-1)
mask = ~batch["action_is_pad"].unsqueeze(-1)
num_valid = mask.sum() * loss.shape[-1]
return (loss * mask).sum() / num_valid.clamp_min(1)
return loss.mean()
@@ -752,8 +753,9 @@ class FlowMatchingObjective(nn.Module):
loss = F.mse_loss(predicted_velocity, target_velocity, reduction="none")
if self.do_mask_loss_for_padding and "action_is_pad" in batch:
valid_mask = ~batch["action_is_pad"]
loss = loss * valid_mask.unsqueeze(-1)
mask = ~batch["action_is_pad"].unsqueeze(-1)
num_valid = mask.sum() * loss.shape[-1]
return (loss * mask).sum() / num_valid.clamp_min(1)
return loss.mean()
@@ -227,6 +227,7 @@ class PI0FastPaliGemma(nn.Module):
# forward(..., adarms_cond=...) is supported (same as pi0/pi05).
if use_adarms[0]:
text_config = self.paligemma.config.text_config
del self.paligemma.model.language_model
self.paligemma.model.language_model = PiGemmaModel(text_config)
self.to_bfloat16_for_selected_params(precision)
+6
View File
@@ -197,6 +197,9 @@ class PiGemmaModel(GemmaModel): # type: ignore[misc]
def __init__(self, config: GemmaConfig, **kwargs):
super().__init__(config, **kwargs)
# Free parent-allocated layers/norm before replacing to avoid ~2x peak memory.
del self.layers
del self.norm
# if not getattr(config, "use_adarms", False):
# return
cond_dim = getattr(config, "adarms_cond_dim", None)
@@ -328,6 +331,7 @@ class PiGemmaForCausalLM(GemmaForCausalLM): # type: ignore[misc]
def __init__(self, config: GemmaConfig, **kwargs):
super().__init__(config, **kwargs)
del self.model
self.model = PiGemmaModel(config)
@@ -336,6 +340,7 @@ class PaliGemmaModelWithPiGemma(PaliGemmaModel):
def __init__(self, config):
super().__init__(config)
del self.language_model
self.language_model = PiGemmaModel(config.text_config)
@@ -344,6 +349,7 @@ class PaliGemmaForConditionalGenerationWithPiGemma(PaliGemmaForConditionalGenera
def __init__(self, config):
super().__init__(config)
del self.model
self.model = PaliGemmaModelWithPiGemma(config)
# Make modules available through conditional class for BC
+2
View File
@@ -19,6 +19,7 @@ from .action_queue import ActionQueue
from .configuration_rtc import RTCConfig
from .latency_tracker import LatencyTracker
from .modeling_rtc import RTCProcessor
from .relative import reanchor_relative_rtc_prefix
__all__ = [
"ActionInterpolator",
@@ -26,4 +27,5 @@ __all__ = [
"LatencyTracker",
"RTCConfig",
"RTCProcessor",
"reanchor_relative_rtc_prefix",
]
+58
View File
@@ -0,0 +1,58 @@
#!/usr/bin/env python
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Relative-action helpers for Real-Time Chunking (RTC)."""
from __future__ import annotations
import torch
from lerobot.processor import (
NormalizerProcessorStep,
RelativeActionsProcessorStep,
TransitionKey,
create_transition,
to_relative_actions,
)
def reanchor_relative_rtc_prefix(
prev_actions_absolute: torch.Tensor,
current_state: torch.Tensor,
relative_step: RelativeActionsProcessorStep,
normalizer_step: NormalizerProcessorStep | None,
policy_device: torch.device | str,
) -> torch.Tensor:
"""Convert absolute leftover actions into model-space for relative-action RTC policies.
When using relative actions, the RTC prefix (previous chunk's unexecuted tail)
is stored in absolute coordinates. Before feeding it back to the policy, this
helper re-expresses those actions relative to the robot's current joint state
and optionally normalizes them so the policy receives correctly scaled inputs.
"""
state = current_state.detach().cpu()
if state.dim() == 1:
state = state.unsqueeze(0)
action_cpu = prev_actions_absolute.detach().cpu()
mask = relative_step._build_mask(action_cpu.shape[-1])
relative_actions = to_relative_actions(action_cpu, state, mask)
transition = create_transition(action=relative_actions)
if normalizer_step is not None:
transition = normalizer_step(transition)
return transition[TransitionKey.ACTION].to(policy_device)
-1
View File
@@ -1 +0,0 @@
../../../../docs/source/policy_sarm_README.md
@@ -394,13 +394,21 @@ class SmolVLAPolicy(PreTrainedPolicy):
loss_dict["losses_after_rm_padding"] = losses.clone().mean().item()
if reduction == "none":
# Return per-sample losses (B,) by averaging over time and action dims
per_sample_loss = losses.mean(dim=(1, 2))
# Return per-sample losses (B,) by averaging over valid (time, action) entries
if actions_is_pad is None:
per_sample_loss = losses.mean(dim=(1, 2))
else:
num_valid = ((~actions_is_pad).sum(dim=1) * losses.shape[-1]).clamp_min(1)
per_sample_loss = losses.sum(dim=(1, 2)) / num_valid
loss_dict["loss"] = per_sample_loss.mean().item()
return per_sample_loss, loss_dict
else:
# Default: return scalar mean loss
loss = losses.mean()
# Default: return scalar mean loss over valid (time, action) entries
if actions_is_pad is None:
loss = losses.mean()
else:
num_valid = ((~actions_is_pad).sum() * losses.shape[-1]).clamp_min(1)
loss = losses.sum() / num_valid
loss_dict["loss"] = loss.item()
return loss, loss_dict
+1 -1
View File
@@ -557,7 +557,7 @@ class RewardClassifierProcessorStep(ProcessorStep):
def __post_init__(self):
"""Initializes the reward classifier model after the dataclass is created."""
if self.pretrained_path is not None:
from lerobot.policies.sac.reward_model.modeling_classifier import Classifier
from lerobot.rewards.classifier.modeling_classifier import Classifier
self.reward_classifier = Classifier.from_pretrained(self.pretrained_path)
self.reward_classifier.to(self.device)
@@ -142,6 +142,10 @@ class RelativeActionsProcessorStep(ProcessorStep):
new_transition[TransitionKey.ACTION] = to_relative_actions(action, state, mask)
return new_transition
def get_cached_state(self) -> torch.Tensor | None:
"""Return the cached ``observation.state`` used as the reference point for relative/absolute action conversions."""
return self._last_state
def get_config(self) -> dict[str, Any]:
return {
"enabled": self.enabled,
@@ -182,7 +186,8 @@ class AbsoluteActionsProcessorStep(ProcessorStep):
"but relative_step is None. Ensure relative_step is set when constructing the postprocessor."
)
if self.relative_step._last_state is None:
cached_state = self.relative_step.get_cached_state()
if cached_state is None:
raise RuntimeError(
"AbsoluteActionsProcessorStep requires state from RelativeActionsProcessorStep "
"but no state has been cached. Ensure the preprocessor runs before the postprocessor."
@@ -194,9 +199,7 @@ class AbsoluteActionsProcessorStep(ProcessorStep):
return new_transition
mask = self.relative_step._build_mask(action.shape[-1])
new_transition[TransitionKey.ACTION] = to_absolute_actions(
action, self.relative_step._last_state, mask
)
new_transition[TransitionKey.ACTION] = to_absolute_actions(action, cached_state, mask)
return new_transition
def get_config(self) -> dict[str, Any]:
+36
View File
@@ -0,0 +1,36 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .classifier.configuration_classifier import RewardClassifierConfig as RewardClassifierConfig
from .factory import (
get_reward_model_class as get_reward_model_class,
make_reward_model as make_reward_model,
make_reward_model_config as make_reward_model_config,
make_reward_pre_post_processors as make_reward_pre_post_processors,
)
from .pretrained import PreTrainedRewardModel as PreTrainedRewardModel
from .sarm.configuration_sarm import SARMConfig as SARMConfig
__all__ = [
# Configuration classes
"RewardClassifierConfig",
"SARMConfig",
# Base class
"PreTrainedRewardModel",
# Factory functions
"get_reward_model_class",
"make_reward_model",
"make_reward_model_config",
"make_reward_pre_post_processors",
]
@@ -1,5 +1,3 @@
# !/usr/bin/env python
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@@ -15,14 +13,15 @@
# limitations under the License.
from dataclasses import dataclass, field
from lerobot.configs import NormalizationMode, PreTrainedConfig
from lerobot.configs import NormalizationMode
from lerobot.configs.rewards import RewardModelConfig
from lerobot.optim import AdamWConfig, LRSchedulerConfig, OptimizerConfig
from lerobot.utils.constants import OBS_IMAGE
@PreTrainedConfig.register_subclass(name="reward_classifier")
@RewardModelConfig.register_subclass(name="reward_classifier")
@dataclass
class RewardClassifierConfig(PreTrainedConfig):
class RewardClassifierConfig(RewardModelConfig):
"""Configuration for the Reward Classifier model."""
name: str = "reward_classifier"
@@ -1,5 +1,3 @@
# !/usr/bin/env python
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@@ -19,11 +17,10 @@ import logging
import torch
from torch import Tensor, nn
from lerobot.rewards.classifier.configuration_classifier import RewardClassifierConfig
from lerobot.rewards.pretrained import PreTrainedRewardModel
from lerobot.utils.constants import OBS_IMAGE, REWARD
from ...pretrained import PreTrainedPolicy
from .configuration_classifier import RewardClassifierConfig
class ClassifierOutput:
"""Wrapper for classifier outputs with additional metadata."""
@@ -99,7 +96,7 @@ class SpatialLearnedEmbeddings(nn.Module):
return output
class Classifier(PreTrainedPolicy):
class Classifier(PreTrainedRewardModel):
"""Image classifier built on top of a pre-trained encoder."""
name = "reward_classifier"
@@ -235,6 +232,16 @@ class Classifier(PreTrainedPolicy):
return ClassifierOutput(logits=logits, probabilities=probabilities, hidden_states=encoder_outputs)
def compute_reward(self, batch: dict[str, Tensor]) -> Tensor:
"""Returns 1.0 for success, 0.0 for failure based on image observations."""
images = [batch[key] for key in self.config.input_features if key.startswith(OBS_IMAGE)]
output = self.predict(images)
if self.config.num_classes == 2:
return (output.probabilities > 0.5).float()
else:
return torch.argmax(output.probabilities, dim=1).float()
def forward(self, batch: dict[str, Tensor]) -> tuple[Tensor, dict[str, Tensor]]:
"""Standard forward pass for training compatible with train.py."""
# Extract images and labels
@@ -269,10 +276,6 @@ class Classifier(PreTrainedPolicy):
def predict_reward(self, batch, threshold=0.5):
"""Eval method. Returns predicted reward with the decision threshold as argument."""
# Check for both OBS_IMAGE and OBS_IMAGES prefixes
batch = self.normalize_inputs(batch)
batch = self.normalize_targets(batch)
# Extract images from batch dict
images = [batch[key] for key in self.config.input_features if key.startswith(OBS_IMAGE)]
@@ -282,28 +285,3 @@ class Classifier(PreTrainedPolicy):
return (probs > threshold).float()
else:
return torch.argmax(self.predict(images).probabilities, dim=1)
def get_optim_params(self):
"""Return optimizer parameters for the policy."""
return self.parameters()
def select_action(self, batch: dict[str, Tensor]) -> Tensor:
"""
This method is required by PreTrainedPolicy but not used for reward classifiers.
The reward classifier is not an actor and does not select actions.
"""
raise NotImplementedError("Reward classifiers do not select actions")
def predict_action_chunk(self, batch: dict[str, Tensor]) -> Tensor:
"""
This method is required by PreTrainedPolicy but not used for reward classifiers.
The reward classifier is not an actor and does not produce action chunks.
"""
raise NotImplementedError("Reward classifiers do not predict action chunks")
def reset(self):
"""
This method is required by PreTrainedPolicy but not used for reward classifiers.
The reward classifier is not an actor and does not select actions.
"""
pass
@@ -1,5 +1,3 @@
# !/usr/bin/env python
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@@ -27,8 +25,7 @@ from lerobot.processor import (
policy_action_to_transition,
transition_to_policy_action,
)
from .configuration_classifier import RewardClassifierConfig
from lerobot.rewards.classifier.configuration_classifier import RewardClassifierConfig
def make_classifier_processor(
@@ -52,8 +49,6 @@ def make_classifier_processor(
Args:
config: The configuration object for the RewardClassifier.
dataset_stats: A dictionary of statistics for normalization.
preprocessor_kwargs: Additional arguments for the pre-processor pipeline.
postprocessor_kwargs: Additional arguments for the post-processor pipeline.
Returns:
A tuple containing the configured pre-processor and post-processor pipelines.
+238
View File
@@ -0,0 +1,238 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import importlib
import logging
from typing import Any
import torch
from lerobot.configs.rewards import RewardModelConfig
from lerobot.processor import PolicyAction, PolicyProcessorPipeline
from lerobot.rewards.classifier.configuration_classifier import RewardClassifierConfig
from lerobot.rewards.pretrained import PreTrainedRewardModel
from lerobot.rewards.sarm.configuration_sarm import SARMConfig
def get_reward_model_class(name: str) -> type[PreTrainedRewardModel]:
"""
Retrieves a reward model class by its registered name.
This function uses dynamic imports to avoid loading all reward model classes into
memory at once, improving startup time and reducing dependencies.
Args:
name: The name of the reward model. Supported names are "reward_classifier",
"sarm".
Returns:
The reward model class corresponding to the given name.
Raises:
ValueError: If the reward model name is not recognized.
"""
if name == "reward_classifier":
from lerobot.rewards.classifier.modeling_classifier import Classifier
return Classifier
elif name == "sarm":
from lerobot.rewards.sarm.modeling_sarm import SARMRewardModel
return SARMRewardModel
else:
try:
return _get_reward_model_cls_from_name(name=name)
except Exception as e:
raise ValueError(f"Reward model type '{name}' is not available.") from e
def make_reward_model_config(reward_type: str, **kwargs) -> RewardModelConfig:
"""
Instantiates a reward model configuration object based on the reward type.
This factory function simplifies the creation of reward model configuration objects
by mapping a string identifier to the corresponding config class.
Args:
reward_type: The type of the reward model. Supported types include
"reward_classifier", "sarm".
**kwargs: Keyword arguments to be passed to the configuration class constructor.
Returns:
An instance of a `RewardModelConfig` subclass.
Raises:
ValueError: If the `reward_type` is not recognized.
"""
if reward_type == "reward_classifier":
return RewardClassifierConfig(**kwargs)
elif reward_type == "sarm":
return SARMConfig(**kwargs)
else:
try:
config_cls = RewardModelConfig.get_choice_class(reward_type)
return config_cls(**kwargs)
except Exception as e:
raise ValueError(f"Reward model type '{reward_type}' is not available.") from e
def make_reward_model(cfg: RewardModelConfig, **kwargs) -> PreTrainedRewardModel:
"""
Instantiate a reward model from its configuration.
Args:
cfg: The configuration for the reward model to be created. If
`cfg.pretrained_path` is set, the model will be loaded with weights
from that path.
**kwargs: Additional keyword arguments forwarded to the model constructor
(e.g., ``dataset_stats``, ``dataset_meta``).
Returns:
An instantiated and device-placed reward model.
"""
reward_cls = get_reward_model_class(cfg.type)
kwargs["config"] = cfg
if cfg.pretrained_path:
kwargs["pretrained_name_or_path"] = cfg.pretrained_path
reward_model = reward_cls.from_pretrained(**kwargs)
else:
reward_model = reward_cls(**kwargs)
reward_model.to(cfg.device)
assert isinstance(reward_model, torch.nn.Module)
return reward_model
def make_reward_pre_post_processors(
reward_cfg: RewardModelConfig,
**kwargs,
) -> tuple[
PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
PolicyProcessorPipeline[PolicyAction, PolicyAction],
]:
"""
Create pre- and post-processor pipelines for a given reward model.
Each reward model type has a dedicated factory function for its processors.
Args:
reward_cfg: The configuration of the reward model for which to create processors.
**kwargs: Additional keyword arguments passed to the processor factory
(e.g., ``dataset_stats``, ``dataset_meta``).
Returns:
A tuple containing the input (pre-processor) and output (post-processor) pipelines.
Raises:
ValueError: If a processor factory is not implemented for the given reward
model configuration type.
"""
# Create a new processor based on reward model type
if isinstance(reward_cfg, RewardClassifierConfig):
from lerobot.rewards.classifier.processor_classifier import make_classifier_processor
return make_classifier_processor(
config=reward_cfg,
dataset_stats=kwargs.get("dataset_stats"),
)
elif isinstance(reward_cfg, SARMConfig):
from lerobot.rewards.sarm.processor_sarm import make_sarm_pre_post_processors
return make_sarm_pre_post_processors(
config=reward_cfg,
dataset_stats=kwargs.get("dataset_stats"),
dataset_meta=kwargs.get("dataset_meta"),
)
else:
try:
processors = _make_processors_from_reward_model_config(
config=reward_cfg,
dataset_stats=kwargs.get("dataset_stats"),
)
except Exception as e:
raise ValueError(
f"Processor for reward model type '{reward_cfg.type}' is not implemented."
) from e
return processors
def _get_reward_model_cls_from_name(name: str) -> type[PreTrainedRewardModel]:
"""Get reward model class from its registered name using dynamic imports.
This is used as a helper function to import reward models from 3rd party lerobot
plugins.
Args:
name: The name of the reward model.
Returns:
The reward model class corresponding to the given name.
"""
if name not in RewardModelConfig.get_known_choices():
raise ValueError(
f"Unknown reward model name '{name}'. "
f"Available reward models: {RewardModelConfig.get_known_choices()}"
)
config_cls = RewardModelConfig.get_choice_class(name)
config_cls_name = config_cls.__name__
model_name = config_cls_name.removesuffix("Config")
if model_name == config_cls_name:
raise ValueError(
f"The config class name '{config_cls_name}' does not follow the expected naming convention. "
f"Make sure it ends with 'Config'!"
)
cls_name = model_name + "RewardModel"
module_path = config_cls.__module__.replace("configuration_", "modeling_")
module = importlib.import_module(module_path)
reward_cls = getattr(module, cls_name)
return reward_cls
def _make_processors_from_reward_model_config(
config: RewardModelConfig,
dataset_stats: dict[str, dict[str, torch.Tensor]] | None = None,
) -> tuple[Any, Any]:
"""Create pre- and post-processors from a reward model configuration using dynamic imports.
This is used as a helper function to import processor factories from 3rd party
lerobot reward model plugins.
Args:
config: The reward model configuration object.
dataset_stats: Dataset statistics for normalization.
Returns:
A tuple containing the input (pre-processor) and output (post-processor) pipelines.
"""
reward_type = config.type
function_name = f"make_{reward_type}_pre_post_processors"
module_path = config.__class__.__module__.replace("configuration_", "processor_")
logging.debug(
f"Instantiating reward pre/post processors using function '{function_name}' "
f"from module '{module_path}'"
)
module = importlib.import_module(module_path)
function = getattr(module, function_name)
return function(config, dataset_stats=dataset_stats)
+244
View File
@@ -0,0 +1,244 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import abc
import builtins
import logging
import os
from importlib.resources import files
from pathlib import Path
from tempfile import TemporaryDirectory
from typing import TYPE_CHECKING, Any, TypeVar
import packaging
import safetensors
from huggingface_hub import HfApi, ModelCard, ModelCardData, hf_hub_download
from huggingface_hub.constants import SAFETENSORS_SINGLE_FILE
from huggingface_hub.errors import HfHubHTTPError
from safetensors.torch import load_model as load_model_as_safetensor, save_model as save_model_as_safetensor
from torch import Tensor, nn
from lerobot.configs.rewards import RewardModelConfig
from lerobot.utils.hub import HubMixin
if TYPE_CHECKING:
from lerobot.configs.train import TrainPipelineConfig
T = TypeVar("T", bound="PreTrainedRewardModel")
class PreTrainedRewardModel(nn.Module, HubMixin, abc.ABC):
"""Base class for reward models."""
config_class: None
name: None
def __init__(self, config: RewardModelConfig, *inputs, **kwargs):
super().__init__()
if not isinstance(config, RewardModelConfig):
raise ValueError(
f"Parameter config in `{self.__class__.__name__}(config)` should be an instance of class "
"`RewardModelConfig`. To create a model from a pretrained model use "
f"`model = {self.__class__.__name__}.from_pretrained(PRETRAINED_MODEL_NAME)`"
)
self.config = config
def __init_subclass__(cls, **kwargs):
super().__init_subclass__(**kwargs)
if not getattr(cls, "config_class", None):
raise TypeError(f"Class {cls.__name__} must define 'config_class'")
if not getattr(cls, "name", None):
raise TypeError(f"Class {cls.__name__} must define 'name'")
def _save_pretrained(self, save_directory: Path) -> None:
self.config._save_pretrained(save_directory)
model_to_save = self.module if hasattr(self, "module") else self
save_model_as_safetensor(model_to_save, str(save_directory / SAFETENSORS_SINGLE_FILE))
@classmethod
def from_pretrained(
cls: builtins.type[T],
pretrained_name_or_path: str | Path,
*,
config: RewardModelConfig | None = None,
force_download: bool = False,
resume_download: bool | None = None,
proxies: dict | None = None,
token: str | bool | None = None,
cache_dir: str | Path | None = None,
local_files_only: bool = False,
revision: str | None = None,
strict: bool = False,
**kwargs,
) -> T:
"""
The reward model is set in evaluation mode by default using `reward.eval()` (dropout modules are
deactivated). To train it, you should first set it back in training mode with `reward.train()`.
"""
if config is None:
config = RewardModelConfig.from_pretrained(
pretrained_name_or_path=pretrained_name_or_path,
force_download=force_download,
resume_download=resume_download,
proxies=proxies,
token=token,
cache_dir=cache_dir,
local_files_only=local_files_only,
revision=revision,
**kwargs,
)
model_id = str(pretrained_name_or_path)
instance = cls(config, **kwargs)
if os.path.isdir(model_id):
print("Loading weights from local directory")
model_file = os.path.join(model_id, SAFETENSORS_SINGLE_FILE)
reward = cls._load_as_safetensor(instance, model_file, config.device or "cpu", strict)
else:
try:
model_file = hf_hub_download(
repo_id=model_id,
filename=SAFETENSORS_SINGLE_FILE,
revision=revision,
cache_dir=cache_dir,
force_download=force_download,
proxies=proxies,
resume_download=resume_download,
token=token,
local_files_only=local_files_only,
)
reward = cls._load_as_safetensor(instance, model_file, config.device or "cpu", strict)
except HfHubHTTPError as e:
raise FileNotFoundError(
f"{SAFETENSORS_SINGLE_FILE} not found on the HuggingFace Hub in {model_id}"
) from e
reward.to(config.device)
reward.eval()
return reward
@classmethod
def _load_as_safetensor(cls, model: T, model_file: str, map_location: str, strict: bool) -> T:
# Create base kwargs
kwargs = {"strict": strict}
# Add device parameter for newer versions that support it
if packaging.version.parse(safetensors.__version__) >= packaging.version.parse("0.4.3"):
kwargs["device"] = map_location
# Load the model with appropriate kwargs
missing_keys, unexpected_keys = load_model_as_safetensor(model, model_file, **kwargs)
if missing_keys:
logging.warning(f"Missing key(s) when loading model: {missing_keys}")
if unexpected_keys:
logging.warning(f"Unexpected key(s) when loading model: {unexpected_keys}")
# For older versions, manually move to device if needed
if "device" not in kwargs and map_location != "cpu":
logging.warning(
"Loading model weights on other devices than 'cpu' is not supported natively in your version of safetensors."
" This means that the model is loaded on 'cpu' first and then copied to the device."
" This leads to a slower loading time."
" Please update safetensors to version 0.4.3 or above for improved performance."
)
model.to(map_location)
return model
def get_optim_params(self):
"""
Returns the reward-model-specific parameters dict to be passed on to the optimizer.
"""
return self.parameters()
def reset(self) -> None:
"""Reset any internal state."""
pass
@abc.abstractmethod
def compute_reward(self, batch: dict[str, Tensor]) -> Tensor:
"""Compute a scalar reward signal for a batch of observations.
Args:
batch: Dictionary containing at minimum observation tensors.
May also contain "action", "next_observation.*", etc.
Returns:
Tensor of shape ``(batch_size,)`` with reward values.
"""
...
def forward(self, batch: dict[str, Tensor]) -> tuple[Tensor, dict[str, Any]]:
"""Training forward pass — override for trainable reward models."""
raise NotImplementedError(
f"{self.__class__.__name__} is not trainable. Only use compute_reward() for inference."
)
@property
def is_trainable(self) -> bool:
"""Whether this reward model can be trained via ``lerobot-train``.
Trainable reward models override :meth:`forward`; zero-shot models
inherit the base implementation that raises ``NotImplementedError``.
"""
return type(self).forward is not PreTrainedRewardModel.forward
def push_model_to_hub(self, cfg: "TrainPipelineConfig"):
api = HfApi()
repo_id = api.create_repo(
repo_id=self.config.repo_id, private=self.config.private, exist_ok=True
).repo_id
# Push the files to the repo in a single commit
with TemporaryDirectory(ignore_cleanup_errors=True) as tmp:
saved_path = Path(tmp) / repo_id
self.save_pretrained(saved_path) # Calls _save_pretrained and stores model tensors
card = self.generate_model_card(
cfg.dataset.repo_id, self.config.type, self.config.license, self.config.tags
)
card.save(str(saved_path / "README.md"))
cfg.save_pretrained(saved_path) # Calls _save_pretrained and stores train config
commit_info = api.upload_folder(
repo_id=repo_id,
repo_type="model",
folder_path=saved_path,
commit_message="Upload reward model weights, train config and readme",
allow_patterns=["*.safetensors", "*.json", "*.yaml", "*.md"],
ignore_patterns=["*.tmp", "*.log"],
)
logging.info(f"Model pushed to {commit_info.repo_url.url}")
def generate_model_card(
self, dataset_repo_id: str, model_type: str, license: str | None, tags: list[str] | None
) -> ModelCard:
card_data = ModelCardData(
license=license or "apache-2.0",
library_name="lerobot",
pipeline_tag="robotics",
tags=list(set(tags or []).union({"robotics", "lerobot", "reward-model", model_type})),
model_name=model_type,
datasets=dataset_repo_id,
)
template_card = (
files("lerobot.templates")
.joinpath("lerobot_rewardmodel_modelcard_template.md")
.read_text(encoding="utf-8")
)
card = ModelCard.from_template(card_data, template_str=template_card)
card.validate()
return card
@@ -1,4 +1,4 @@
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -14,5 +14,6 @@
from .configuration_sarm import SARMConfig
from .modeling_sarm import SARMRewardModel
from .processor_sarm import make_sarm_pre_post_processors
__all__ = ["SARMConfig", "SARMRewardModel"]
__all__ = ["SARMConfig", "SARMRewardModel", "make_sarm_pre_post_processors"]
@@ -25,18 +25,18 @@ need ~num_frames/30 queries instead of one per frame (~30x speedup).
Usage:
# Full RA-BC computation with visualizations
python src/lerobot/policies/sarm/compute_rabc_weights.py \\
python src/lerobot/rewards/sarm/compute_rabc_weights.py \\
--dataset-repo-id lerobot/aloha_sim_insertion_human \\
--reward-model-path <USER>/sarm_single_uni4
# Faster computation with stride (compute every 5 frames, interpolate the rest)
python src/lerobot/policies/sarm/compute_rabc_weights.py \\
python src/lerobot/rewards/sarm/compute_rabc_weights.py \\
--dataset-repo-id lerobot/aloha_sim_insertion_human \\
--reward-model-path <USER>/sarm_single_uni4 \\
--stride 5
# Visualize predictions only (no RA-BC computation)
python src/lerobot/policies/sarm/compute_rabc_weights.py \\
python src/lerobot/rewards/sarm/compute_rabc_weights.py \\
--dataset-repo-id lerobot/aloha_sim_insertion_human \\
--reward-model-path <USER>/sarm_single_uni4 \\
--visualize-only \\
@@ -58,10 +58,9 @@ import torch
from tqdm import tqdm
from lerobot.datasets import LeRobotDataset
from .modeling_sarm import SARMRewardModel
from .processor_sarm import make_sarm_pre_post_processors
from .sarm_utils import normalize_stage_tau
from lerobot.rewards.sarm.modeling_sarm import SARMRewardModel
from lerobot.rewards.sarm.processor_sarm import make_sarm_pre_post_processors
from lerobot.rewards.sarm.sarm_utils import normalize_stage_tau
def get_reward_model_path_from_parquet(parquet_path: Path) -> str | None:
@@ -713,12 +712,12 @@ def main():
epilog="""
Examples:
# Full RA-BC computation with visualizations
python src/lerobot/policies/sarm/compute_rabc_weights.py \\
python src/lerobot/rewards/sarm/compute_rabc_weights.py \\
--dataset-repo-id lerobot/aloha_sim_insertion_human \\
--reward-model-path <USER>/sarm_single_uni4
# Visualize predictions only (no RA-BC computation)
python src/lerobot/policies/sarm/compute_rabc_weights.py \\
python src/lerobot/rewards/sarm/compute_rabc_weights.py \\
--dataset-repo-id lerobot/aloha_sim_insertion_human \\
--reward-model-path <USER>/sarm_single_uni4 \\
--visualize-only \\
@@ -1,5 +1,3 @@
#!/usr/bin/env python
# Copyright 2025 Qianzhong Chen, Justin Yu, Mac Schwager, Pieter Abbeel, Yide Shentu, Philipp Wu
# and The HuggingFace Inc. team. All rights reserved.
#
@@ -22,14 +20,15 @@ Paper: https://arxiv.org/abs/2509.25358
from dataclasses import dataclass, field
from lerobot.configs import FeatureType, NormalizationMode, PolicyFeature, PreTrainedConfig
from lerobot.configs import FeatureType, NormalizationMode, PolicyFeature
from lerobot.configs.rewards import RewardModelConfig
from lerobot.optim import AdamWConfig, CosineDecayWithWarmupSchedulerConfig
from lerobot.utils.constants import OBS_IMAGES, OBS_STATE
@PreTrainedConfig.register_subclass("sarm")
@RewardModelConfig.register_subclass("sarm")
@dataclass
class SARMConfig(PreTrainedConfig):
class SARMConfig(RewardModelConfig):
"""Configuration class for SARM (Stage-Aware Reward Modeling).
Supports three annotation modes:
@@ -110,7 +109,6 @@ class SARMConfig(PreTrainedConfig):
def __post_init__(self):
super().__post_init__()
if self.annotation_mode not in ["single_stage", "dense_only", "dual"]:
raise ValueError(
f"annotation_mode must be 'single_stage', 'dense_only', or 'dual', got {self.annotation_mode}"
@@ -1,5 +1,3 @@
#!/usr/bin/env python
# Copyright 2025 Qianzhong Chen, Justin Yu, Mac Schwager, Pieter Abbeel, Yide Shentu, Philipp Wu
# and The HuggingFace Inc. team. All rights reserved.
#
@@ -34,14 +32,13 @@ import torch.nn as nn
import torch.nn.functional as F # noqa: N812
from torch import Tensor
from lerobot.utils.constants import OBS_STR
from ..pretrained import PreTrainedPolicy
from .configuration_sarm import SARMConfig
from .sarm_utils import (
from lerobot.rewards.pretrained import PreTrainedRewardModel
from lerobot.rewards.sarm.configuration_sarm import SARMConfig
from lerobot.rewards.sarm.sarm_utils import (
normalize_stage_tau,
pad_state_to_max_dim,
)
from lerobot.utils.constants import OBS_STR
class StageTransformer(nn.Module):
@@ -353,7 +350,7 @@ def gen_stage_emb(num_classes: int, targets: torch.Tensor) -> torch.Tensor:
return stage_onehot
class SARMRewardModel(PreTrainedPolicy):
class SARMRewardModel(PreTrainedRewardModel):
"""
SARM Reward Model for stage-aware task completion rewards.
@@ -471,6 +468,23 @@ class SARMRewardModel(PreTrainedPolicy):
self.subtask_model.to(device)
return self
def compute_reward(self, batch: dict[str, Tensor]) -> Tensor:
"""Compute dense progress reward in [0, 1] from batch.
Expects batch to contain:
- "observation_features" or video embeddings: (B, T, 512)
- "language_embedding" or text embeddings: (B, 512)
- optionally "observation.state": (B, T, state_dim)
"""
text_emb = batch.get("language_embedding", batch.get("text_features"))
video_emb = batch.get("observation_features", batch.get("video_features"))
state = batch.get("observation.state", batch.get("state_features"))
rewards = self.calculate_rewards(text_emb, video_emb, state)
if isinstance(rewards, np.ndarray):
rewards = torch.from_numpy(rewards).float()
return rewards
@torch.no_grad()
def calculate_rewards(
self,
@@ -631,17 +645,9 @@ class SARMRewardModel(PreTrainedPolicy):
return self.parameters()
def reset(self):
"""Required by PreTrainedPolicy but not used for reward models."""
"""SARM has no episode-level state to reset."""
pass
def predict_action_chunk(self, batch: dict[str, Tensor]) -> Tensor:
"""Required by PreTrainedPolicy but not used for reward models."""
raise NotImplementedError("SARM model does not predict action chunks")
def select_action(self, batch: dict[str, Tensor]) -> Tensor:
"""Required by PreTrainedPolicy but not used for SARM."""
raise NotImplementedError("SARM model does not select actions")
def _train_step(
self,
img_emb: torch.Tensor, # (B, N, T, D)
@@ -1,5 +1,3 @@
#!/usr/bin/env python
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@@ -60,16 +58,15 @@ from lerobot.processor import (
policy_action_to_transition,
transition_to_policy_action,
)
from lerobot.types import EnvTransition, PolicyAction, TransitionKey
from lerobot.utils.constants import POLICY_POSTPROCESSOR_DEFAULT_NAME, POLICY_PREPROCESSOR_DEFAULT_NAME
from .configuration_sarm import SARMConfig
from .sarm_utils import (
from lerobot.rewards.sarm.configuration_sarm import SARMConfig
from lerobot.rewards.sarm.sarm_utils import (
apply_rewind_augmentation,
compute_absolute_indices,
find_stage_and_tau,
pad_state_to_max_dim,
)
from lerobot.types import EnvTransition, PolicyAction, TransitionKey
from lerobot.utils.constants import POLICY_POSTPROCESSOR_DEFAULT_NAME, POLICY_PREPROCESSOR_DEFAULT_NAME
class SARMEncodingProcessorStep(ProcessorStep):
@@ -455,7 +452,13 @@ class SARMEncodingProcessorStep(ProcessorStep):
inputs = {k: v.to(self.device) for k, v in inputs.items()}
# Get image embeddings
embeddings = self.clip_model.get_image_features(**inputs).detach().cpu()
# transformers 5.x returns BaseModelOutputWithPooling instead of a plain tensor
output = self.clip_model.get_image_features(**inputs)
if not isinstance(output, torch.Tensor):
output = output.pooler_output
if output is None:
raise ValueError("pooler_output should not be None for CLIP models.")
embeddings = output.detach().cpu()
# Handle single frame case
if embeddings.dim() == 1:
@@ -482,7 +485,13 @@ class SARMEncodingProcessorStep(ProcessorStep):
inputs = self.clip_processor.tokenizer([text], return_tensors="pt", padding=True, truncation=True)
inputs = {k: v.to(self.device) for k, v in inputs.items()}
text_embedding = self.clip_model.get_text_features(**inputs).detach().cpu()
# transformers 5.x returns BaseModelOutputWithPooling instead of a plain tensor
output = self.clip_model.get_text_features(**inputs)
if not isinstance(output, torch.Tensor):
output = output.pooler_output
if output is None:
raise ValueError("pooler_output should not be None for CLIP models.")
text_embedding = output.detach().cpu()
text_embedding = text_embedding.expand(batch_size, -1)
return text_embedding
@@ -1,5 +1,3 @@
#!/usr/bin/env python
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@@ -14,14 +12,38 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""
RA-BC (Reward-Aligned Behavior Cloning) sample weighting implementation.
This module implements the SampleWeighter protocol for RA-BC training,
which weights training samples based on their task progress as measured
by the SARM reward model.
The weights are computed based on progress deltas:
delta = progress[t + chunk_size] - progress[t]
High-quality samples (positive progress) get higher weights, while
samples with negative progress (going backwards) get zero weight.
See: https://arxiv.org/abs/2509.25358 for the SARM paper.
"""
import logging
from pathlib import Path
from typing import TYPE_CHECKING
import numpy as np
import pandas as pd
import torch
from huggingface_hub import hf_hub_download
from lerobot.utils.import_utils import _pandas_available
from lerobot.utils.sample_weighting import SampleWeighter
if TYPE_CHECKING or _pandas_available:
import pandas as pd
else:
pd = None # type: ignore[assignment]
def resolve_hf_path(path: str | Path) -> Path:
"""Resolve a path that may be a HuggingFace URL (hf://datasets/...) to a local path."""
@@ -34,23 +56,27 @@ def resolve_hf_path(path: str | Path) -> Path:
return Path(path)
class RABCWeights:
class RABCWeights(SampleWeighter):
"""
Load precomputed SARM progress values and compute RA-BC weights during training.
This class implements the SampleWeighter ABC for use with the generic
sample weighting infrastructure in lerobot.
Progress values are loaded from a parquet file (generated by compute_rabc_weights.py).
During training, computes:
- progress_delta = progress[t + chunk_size] - progress[t]
- rabc_weight based on the delta (paper Eq. 8-9)
Args:
progress_path: Path to parquet file with precomputed progress values
chunk_size: Number of frames ahead for computing progress delta
head_mode: Which SARM head to use ("sparse" or "dense")
kappa: Hard threshold for high-quality samples (default: 0.01)
epsilon: Small constant for numerical stability (default: 1e-6)
fallback_weight: Weight to use for frames without valid delta (default: 1.0)
device: Device to return tensors on
progress_path: Path to parquet file with precomputed progress values.
Supports HuggingFace URLs (hf://datasets/...).
chunk_size: Number of frames ahead for computing progress delta.
head_mode: Which SARM head to use ("sparse" or "dense").
kappa: Hard threshold for high-quality samples (default: 0.01).
epsilon: Small constant for numerical stability (default: 1e-6).
fallback_weight: Weight to use for frames without valid delta (default: 1.0).
device: Device to return tensors on.
"""
def __init__(
@@ -61,7 +87,7 @@ class RABCWeights:
kappa: float = 0.01,
epsilon: float = 1e-6,
fallback_weight: float = 1.0,
device: torch.device = None,
device: torch.device | None = None,
):
self.progress_path = resolve_hf_path(progress_path)
self.chunk_size = chunk_size
@@ -87,8 +113,8 @@ class RABCWeights:
logging.info(f"Using progress column: {self.progress_column}")
self.progress_lookup = {}
self.episode_lookup = {}
self.progress_lookup: dict[int, float] = {}
self.episode_lookup: dict[int, int] = {}
for _, row in self.df.iterrows():
global_idx = int(row["index"])
@@ -100,7 +126,7 @@ class RABCWeights:
self.episode_lookup[global_idx] = episode_idx
# Build episode boundaries for delta computation
self.episode_boundaries = {}
self.episode_boundaries: dict[int, dict[str, int]] = {}
for episode_idx in self.df["episode_index"].unique():
ep_df = self.df[self.df["episode_index"] == episode_idx]
self.episode_boundaries[int(episode_idx)] = {
@@ -114,7 +140,7 @@ class RABCWeights:
# Compute global statistics for weight computation
self._compute_global_stats()
def _compute_global_stats(self):
def _compute_global_stats(self) -> None:
"""Compute global mean and std of progress deltas for weight calculation."""
all_deltas = []
@@ -138,8 +164,8 @@ class RABCWeights:
all_deltas.append(delta)
if all_deltas:
self.delta_mean = max(np.mean(all_deltas), 0.0)
self.delta_std = max(np.std(all_deltas), self.epsilon)
self.delta_mean = max(float(np.mean(all_deltas)), 0.0)
self.delta_std = max(float(np.std(all_deltas)), self.epsilon)
logging.info(f"Progress delta stats: mean={self.delta_mean:.4f}, std={self.delta_std:.4f}")
else:
self.delta_mean = 0.0
@@ -157,18 +183,19 @@ class RABCWeights:
4. Compute weight using paper Eq. 8-9
Args:
batch: Training batch containing "index" key with global frame indices
batch: Training batch containing "index" key with global frame indices.
Returns:
Tuple of:
- Weights tensor (batch_size,) normalized to sum to batch_size
- Stats dict with raw_mean_weight, num_zero_weight, num_full_weight
- Weights tensor (batch_size,) normalized to sum to batch_size.
- Stats dict with weighting statistics for logging.
"""
indices = batch.get("index")
if indices is None:
logging.warning("RA-BC: Batch missing 'index' key, using uniform weights")
batch_size = self._get_batch_size(batch)
return torch.ones(batch_size, device=self.device), {"raw_mean_weight": 1.0}
stats = {"mean_weight": 1.0, "num_zero_weight": 0, "num_full_weight": batch_size}
return torch.ones(batch_size, device=self.device), stats
# Convert to list of ints
if isinstance(indices, torch.Tensor):
@@ -183,29 +210,29 @@ class RABCWeights:
delta = self._compute_delta(idx)
deltas.append(delta)
deltas = np.array(deltas, dtype=np.float32)
deltas_array = np.array(deltas, dtype=np.float32)
# Compute weights from deltas
weights = self._compute_weights(deltas)
weights = self._compute_weights(deltas_array)
# Compute stats before normalization for logging
raw_mean_weight = float(np.nanmean(weights))
num_zero_weight = int(np.sum(weights == 0))
num_full_weight = int(np.sum(weights == 1.0))
batch_stats = {
"raw_mean_weight": raw_mean_weight,
"mean_weight": raw_mean_weight,
"num_zero_weight": num_zero_weight,
"num_full_weight": num_full_weight,
}
weights = torch.tensor(weights, device=self.device, dtype=torch.float32)
weights_tensor = torch.tensor(weights, device=self.device, dtype=torch.float32)
# Normalize to sum to batch_size
batch_size = len(weights)
weight_sum = weights.sum() + self.epsilon
weights = weights * batch_size / weight_sum
batch_size = len(weights_tensor)
weight_sum = weights_tensor.sum() + self.epsilon
weights_tensor = weights_tensor * batch_size / weight_sum
return weights, batch_stats
return weights_tensor, batch_stats
def _compute_delta(self, global_idx: int) -> float:
"""Compute progress delta for a single frame."""
@@ -241,7 +268,7 @@ class RABCWeights:
- Final weight: wi = 1{ri > κ} + 1{0 ri κ}˜wi
Returns:
Array of weights
Array of weights.
"""
valid_mask = ~np.isnan(deltas)
@@ -273,12 +300,13 @@ class RABCWeights:
if key in batch:
val = batch[key]
if isinstance(val, (torch.Tensor, np.ndarray)):
return val.shape[0]
return int(val.shape[0])
return 1
def get_stats(self) -> dict:
"""Get statistics."""
"""Get global statistics about the RA-BC weighting."""
return {
"type": "rabc",
"num_frames": len(self.progress_lookup),
"chunk_size": self.chunk_size,
"head_mode": self.head_mode,
@@ -1,5 +1,3 @@
#!/usr/bin/env python
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
+3 -3
View File
@@ -193,15 +193,15 @@ def convert_lerobot_dataset_to_cropped_lerobot_dataset(
fps=int(original_dataset.fps),
root=new_dataset_root,
robot_type=original_dataset.meta.robot_type,
features=original_dataset.meta.info["features"],
features=original_dataset.meta.info.features,
use_videos=len(original_dataset.meta.video_keys) > 0,
)
# Update the metadata for every image key that will be cropped:
# (Here we simply set the shape to be the final resize_size.)
for key in crop_params_dict:
if key in new_dataset.meta.info["features"]:
new_dataset.meta.info["features"][key]["shape"] = [3] + list(resize_size)
if key in new_dataset.meta.info.features:
new_dataset.meta.info.features[key]["shape"] = (3, *resize_size)
# TODO: Directly modify the mp4 video + meta info features, instead of recreating a dataset
prev_episode_index = 0
+18 -3
View File
@@ -68,9 +68,16 @@ class SOFollower(Robot):
@property
def _cameras_ft(self) -> dict[str, tuple]:
return {
cam: (self.config.cameras[cam].height, self.config.cameras[cam].width, 3) for cam in self.cameras
}
features: dict[str, tuple] = {}
for cam in self.cameras:
cam_cfg = self.config.cameras[cam]
features[cam] = (cam_cfg.height, cam_cfg.width, 3)
# Cameras with a depth stream (e.g. RealSense with use_depth=True) also
# emit a 2D depth feature; hw_to_dataset_features routes 2D shapes to
# ``observation.depth.<bare>`` with the depth-map marker.
if getattr(cam_cfg, "use_depth", False):
features[f"{cam}_depth"] = (cam_cfg.height, cam_cfg.width)
return features
@cached_property
def observation_features(self) -> dict[str, type | tuple]:
@@ -190,6 +197,14 @@ class SOFollower(Robot):
dt_ms = (time.perf_counter() - start) * 1e3
logger.debug(f"{self} read {cam_key}: {dt_ms:.1f}ms")
# Cameras with a depth stream populate a sibling ``<cam>_depth`` key
# (consumed by hw_to_dataset_features / build_dataset_frame).
if getattr(self.config.cameras[cam_key], "use_depth", False):
start = time.perf_counter()
obs_dict[f"{cam_key}_depth"] = cam.read_latest_depth()
dt_ms = (time.perf_counter() - start) * 1e3
logger.debug(f"{self} read {cam_key} depth: {dt_ms:.1f}ms")
return obs_dict
@check_if_not_connected
+9 -3
View File
@@ -75,7 +75,7 @@ class SentryStrategyConfig(RolloutStrategyConfig):
# Target video file size in MB for episode rotation. Episodes are
# saved once the estimated video duration would exceed this limit.
# Defaults to DEFAULT_VIDEO_FILE_SIZE_IN_MB when set to None.
target_video_file_size_mb: float | None = None
target_video_file_size_mb: int | None = None
@RolloutStrategyConfig.register_subclass("highlight")
@@ -90,7 +90,7 @@ class HighlightStrategyConfig(RolloutStrategyConfig):
"""
ring_buffer_seconds: float = 10.0
ring_buffer_max_memory_mb: float = 1024.0
ring_buffer_max_memory_mb: int = 1024
save_key: str = "s"
push_key: str = "h"
@@ -150,7 +150,7 @@ class DAggerStrategyConfig(RolloutStrategyConfig):
upload_every_n_episodes: int = 5
# Target video file size in MB for episode rotation (record_autonomous
# mode only). Defaults to DEFAULT_VIDEO_FILE_SIZE_IN_MB when None.
target_video_file_size_mb: float | None = None
target_video_file_size_mb: int | None = None
input_device: str = "keyboard"
keyboard: DAggerKeyboardConfig = field(default_factory=DAggerKeyboardConfig)
pedal: DAggerPedalConfig = field(default_factory=DAggerPedalConfig)
@@ -209,6 +209,12 @@ class RolloutConfig:
# Rename map for mapping robot/dataset observation keys to policy keys
rename_map: dict[str, str] = field(default_factory=dict)
# Hardware teardown
# When True (default), smoothly interpolate the robot back to the joint
# positions captured at startup before disconnecting. Set to False to
# leave the robot in its final achieved pose at shutdown.
return_to_initial_position: bool = True
# Torch compile
use_torch_compile: bool = False
torch_compile_backend: str = "inductor"
+36 -23
View File
@@ -27,7 +27,7 @@ from threading import Event
import torch
from lerobot.configs import FeatureType, PreTrainedConfig
from lerobot.configs import FeatureType
from lerobot.datasets import (
LeRobotDataset,
aggregate_pipeline_dataset_features,
@@ -43,6 +43,7 @@ from lerobot.processor import (
make_default_processors,
rename_stats,
)
from lerobot.processor.relative_action_processor import RelativeActionsProcessorStep
from lerobot.robots import make_robot_from_config
from lerobot.teleoperators import Teleoperator, make_teleoperator_from_config
from lerobot.utils.feature_utils import combine_feature_dicts, hw_to_dataset_features
@@ -51,6 +52,7 @@ from .configs import BaseStrategyConfig, DAggerStrategyConfig, RolloutConfig
from .inference import (
InferenceEngine,
RTCInferenceConfig,
SyncInferenceConfig,
create_inference_engine,
)
from .robot_wrapper import ThreadSafeRobot
@@ -176,33 +178,26 @@ def build_rollout_context(
policy_config = cfg.policy
policy_class = get_policy_class(policy_config.type)
full_config = PreTrainedConfig.from_pretrained(cfg.policy.pretrained_path)
for attr in ("device", "use_amp"):
if hasattr(cfg.policy, attr) and hasattr(full_config, attr):
cli_val = getattr(cfg.policy, attr)
if cli_val is not None:
setattr(full_config, attr, cli_val)
if hasattr(policy_config, "compile_model"):
policy_config.compile_model = cfg.use_torch_compile
if hasattr(full_config, "compile_model"):
full_config.compile_model = cfg.use_torch_compile
if full_config.type == "vqbet" and cfg.device == "mps":
if policy_config.type == "vqbet" and cfg.device == "mps":
raise NotImplementedError(
"Current implementation of VQBeT does not support `mps` backend. "
"Please use `cpu` or `cuda` backend."
)
if full_config.use_peft:
if policy_config.use_peft:
from peft import PeftConfig, PeftModel
peft_path = cfg.policy.pretrained_path
peft_path = policy_config.pretrained_path
peft_config = PeftConfig.from_pretrained(peft_path)
policy = policy_class.from_pretrained(
pretrained_name_or_path=peft_config.base_model_name_or_path, config=full_config
pretrained_name_or_path=peft_config.base_model_name_or_path, config=policy_config
)
policy = PeftModel.from_pretrained(policy, peft_path, config=peft_config)
else:
policy = policy_class.from_pretrained(cfg.policy.pretrained_path, config=full_config)
policy = policy_class.from_pretrained(policy_config.pretrained_path, config=policy_config)
if is_rtc:
policy.config.rtc_config = cfg.inference.rtc
@@ -257,10 +252,12 @@ def build_rollout_context(
teleop.connect()
logger.info("Teleoperator connected")
# DAgger requires teleop with motor control capabilities (enable_torque,
# disable_torque, write_goal_positions).
# TODO(Steven): either enforce this (meaning all teleop must implement these methods) or
# user is responsible for moving the teleop to the same position as the robot when starting the correction.
# TODO(Steven): once Teleoperator motor-control methods are standardised
# (``enable_torque`` / ``disable_torque`` / ``write_goal_positions``), gate
# the DAgger strategy on their presence here and fail fast with a helpful
# message instead of relying on the operator to pre-align the leader by
# hand. See :func:`DAggerStrategy._apply_transition` for the matching
# disabled call sites.
# if isinstance(cfg.strategy, DAggerStrategyConfig) and teleop is not None:
# required_teleop_methods = ("enable_torque", "disable_torque", "write_goal_positions")
# missing = [m for m in required_teleop_methods if not callable(getattr(teleop, m, None))]
@@ -272,10 +269,13 @@ def build_rollout_context(
# )
# --- 4. Features + action-key reconciliation ---------------------
# TODO(Steven): Only `.pos` joint features are used for policy inference — velocity and
# torque channels are observation-only and must be excluded from the state
# and action tensors that the policy sees.
# TODO(Steven):Only ``.pos`` joint features are routed to the policy as state and as the
# action target; velocity and torque channels (when present) are kept in
# the raw observation but excluded from the policy-facing tensors.
all_obs_features = robot.observation_features
# ``observation_features`` values are either a tuple (camera shape) or the
# ``float`` type itself used as a sentinel for scalar motor features —
# see ``dict[str, type | tuple]`` annotation on ``Robot.observation_features``.
observation_features_hw = {
k: v
for k, v in all_obs_features.items()
@@ -308,7 +308,9 @@ def build_rollout_context(
# Validate visual features if no rename_map is active
rename_map = cfg.rename_map
if not rename_map:
expected_visuals = {k for k, v in full_config.input_features.items() if v.type == FeatureType.VISUAL}
expected_visuals = {
k for k, v in policy_config.input_features.items() if v.type == FeatureType.VISUAL
}
provided_visuals = {
f"observation.images.{k}" for k, v in robot.observation_features.items() if isinstance(v, tuple)
}
@@ -353,6 +355,7 @@ def build_rollout_context(
"Use --dataset.repo_id=<user>/rollout_<name> for policy deployment datasets."
)
cfg.dataset.stamp_repo_id()
target_video_mb = getattr(cfg.strategy, "target_video_file_size_mb", None)
dataset = LeRobotDataset.create(
cfg.dataset.repo_id,
cfg.dataset.fps,
@@ -368,6 +371,7 @@ def build_rollout_context(
streaming_encoding=cfg.dataset.streaming_encoding,
encoder_queue_maxsize=cfg.dataset.encoder_queue_maxsize,
encoder_threads=cfg.dataset.encoder_threads,
video_files_size_in_mb=target_video_mb,
)
if dataset is not None:
@@ -391,6 +395,15 @@ def build_rollout_context(
},
)
if isinstance(cfg.inference, SyncInferenceConfig) and any(
isinstance(step, RelativeActionsProcessorStep) and step.enabled
for step in getattr(preprocessor, "steps", ())
):
raise NotImplementedError(
"SyncInferenceEngine does not support policies with relative actions for now."
"Use --inference.type=rtc or remove relative action processor steps from the policy pipeline."
)
# --- 7. Inference strategy (needs policy + pre/post + hardware) --
logger.info(
"Creating inference engine (type=%s)...",
+2 -2
View File
@@ -14,8 +14,8 @@
"""Inference engine package — backend-agnostic action production.
Concrete strategies (sync, RTC, ) expose the same small interface so
rollout strategies never branch on the inference backend.
Concrete backends (``sync``, ``rtc``, ...) expose the same small interface so
rollout strategies never branch on which backend is in use.
"""
from .base import InferenceEngine
+8 -7
View File
@@ -15,8 +15,8 @@
"""Inference engine ABC.
Rollout strategies consume actions through this small interface so they
do not need to know whether the inference engine is synchronous, runs in
a background thread (RTC), or comes from an external source.
do not need to know whether inference happens inline on the control thread
or asynchronously in a background thread (RTC).
"""
from __future__ import annotations
@@ -29,9 +29,10 @@ import torch
class InferenceEngine(abc.ABC):
"""Abstract backend for producing actions during rollout.
Subclasses decide whether inference happens inline, in a background
thread, or externally. The contract is minimal so new backends can
be added without touching rollout strategies.
Subclasses decide whether inference happens inline on the control
thread or asynchronously in a background thread. The contract is
minimal so additional backends can be plugged in without touching
rollout strategies.
Lifecycle
---------
@@ -43,8 +44,8 @@ class InferenceEngine(abc.ABC):
-----------------
``get_action(obs_frame)`` return the next action tensor, or
``None`` if none is available (e.g. async queue empty). Sync
backends always compute from ``obs_frame``; async backends may
ignore it (they get observations via ``notify_observation``).
backends always compute from ``obs_frame``; async backends ignore
it (they receive observations via ``notify_observation``).
Optional hooks
--------------
+2 -3
View File
@@ -68,9 +68,8 @@ class SyncInferenceConfig(InferenceEngineConfig):
class RTCInferenceConfig(InferenceEngineConfig):
"""Real-Time Chunking: async policy inference in a background thread."""
# ``RTCConfig`` is a small dataclass with default-only fields, so eagerly
# constructing one here costs nothing and keeps draccus' CLI surface flat
# (``--inference.rtc.execution_horizon=...`` etc.). No need to lazy-init.
# Eagerly constructed so draccus exposes nested fields directly on the CLI
# (e.g. ``--inference.rtc.execution_horizon=...``).
rtc: RTCConfig = field(default_factory=RTCConfig)
queue_threshold: int = 30
+7 -55
View File
@@ -32,18 +32,14 @@ from typing import Any
import torch
from lerobot.policies.pretrained import PreTrainedPolicy
from lerobot.policies.rtc import ActionQueue, LatencyTracker
from lerobot.policies.rtc import ActionQueue, LatencyTracker, reanchor_relative_rtc_prefix
from lerobot.policies.rtc.configuration_rtc import RTCConfig
from lerobot.policies.utils import prepare_observation_for_inference
from lerobot.processor import (
NormalizerProcessorStep,
PolicyProcessorPipeline,
RelativeActionsProcessorStep,
TransitionKey,
create_transition,
to_relative_actions,
)
from lerobot.utils.constants import OBS_STATE
from lerobot.utils.feature_utils import build_dataset_frame
from ..robot_wrapper import ThreadSafeRobot
@@ -66,35 +62,6 @@ _RTC_JOIN_TIMEOUT_S: float = 3.0
# ---------------------------------------------------------------------------
def _reanchor_relative_rtc_prefix(
prev_actions_absolute: torch.Tensor,
current_state: torch.Tensor,
relative_step: RelativeActionsProcessorStep,
normalizer_step: NormalizerProcessorStep | None,
policy_device: torch.device | str,
) -> torch.Tensor:
"""Convert absolute leftover actions into model-space for relative-action RTC policies.
When using relative actions, the RTC prefix (previous chunk's unexecuted tail)
is stored in absolute coordinates. Before feeding it back to the policy, this
helper re-expresses those actions relative to the robot's current joint state
and optionally normalizes them so the policy receives correctly scaled inputs.
"""
state = current_state.detach().cpu()
if state.dim() == 1:
state = state.unsqueeze(0)
action_cpu = prev_actions_absolute.detach().cpu()
mask = relative_step._build_mask(action_cpu.shape[-1])
relative_actions = to_relative_actions(action_cpu, state, mask)
transition = create_transition(action=relative_actions)
if normalizer_step is not None:
transition = normalizer_step(transition)
return transition[TransitionKey.ACTION].to(policy_device)
def _normalize_prev_actions_length(prev_actions: torch.Tensor, target_steps: int) -> torch.Tensor:
"""Pad or truncate RTC prefix actions to a fixed length for stable compiled inference."""
if prev_actions.ndim != 2:
@@ -109,21 +76,6 @@ def _normalize_prev_actions_length(prev_actions: torch.Tensor, target_steps: int
return padded
def _get_current_raw_state(
relative_step: RelativeActionsProcessorStep,
fallback_state: torch.Tensor | None,
) -> torch.Tensor | None:
"""Return the current raw state cached by the relative-action step.
``RelativeActionsProcessorStep`` caches the observation state before any
observation normalization. Re-anchoring RTC leftovers must use that raw
state rather than the normalized observation that the policy consumes.
"""
if relative_step._last_state is not None:
return relative_step._last_state
return fallback_state
# ---------------------------------------------------------------------------
# RTCInferenceEngine
# ---------------------------------------------------------------------------
@@ -333,15 +285,15 @@ class RTCInferenceEngine(InferenceEngine):
preprocessed = self._preprocessor(obs_batch)
if prev_actions is not None and self._relative_step is not None:
state_tensor = _get_current_raw_state(
self._relative_step, obs_batch.get(OBS_STATE)
)
if state_tensor is not None:
# Rebase against the raw cached state so the leftover tail stays in
# the training-time coordinate frame.
raw_state = self._relative_step.get_cached_state()
if raw_state is not None:
prev_abs = queue.get_processed_left_over()
if prev_abs is not None and prev_abs.numel() > 0:
prev_actions = _reanchor_relative_rtc_prefix(
prev_actions = reanchor_relative_rtc_prefix(
prev_actions_absolute=prev_abs,
current_state=state_tensor,
current_state=raw_state,
relative_step=self._relative_step,
normalizer_step=self._normalizer_step,
policy_device=policy_device,
+27 -46
View File
@@ -17,28 +17,41 @@
from __future__ import annotations
import logging
from collections import deque
from contextlib import nullcontext
from copy import copy
import torch
from lerobot.policies.pretrained import PreTrainedPolicy
from lerobot.policies.utils import prepare_observation_for_inference
from lerobot.processor import PolicyProcessorPipeline, RelativeActionsProcessorStep
from lerobot.utils.constants import ACTION
from lerobot.policies.utils import make_robot_action, prepare_observation_for_inference
from lerobot.processor import PolicyProcessorPipeline
from .base import InferenceEngine
logger = logging.getLogger(__name__)
# TODO(Steven): support relative-action policies. The per-tick flow refreshes
# ``RelativeActionsProcessorStep._last_state`` every call, so cached chunk
# actions popped on later ticks get reanchored to the *current* robot state and
# absolute targets drift through the chunk. Relative-action policies are
# rejected at context-build time today; RTC postprocesses the whole chunk and
# is unaffected.
#
# Candidate fix: drive the policy via ``predict_action_chunk`` and serve a
# local FIFO of postprocessed actions. Eliminates drift by construction and
# saves per-tick pre/post work, but bypasses ``select_action`` — needs
# fallbacks for SAC (raises), ACT temporal ensembling (ensembler lives in
# ``select_action``), and Diffusion-family (obs-history queues populated as a
# side effect of ``select_action``).
class SyncInferenceEngine(InferenceEngine):
"""Inline synchronous inference: compute one action per call.
``get_action`` runs the full policy pipeline when its local action
queue is empty, postprocesses the whole predicted chunk immediately,
and then returns one already-postprocessed CPU action at a time.
``get_action`` runs the full policy pipeline (pre/post-processor +
``select_action``) on the given observation frame and returns a
CPU action tensor reordered to match the dataset action keys.
"""
def __init__(
@@ -60,19 +73,6 @@ class SyncInferenceEngine(InferenceEngine):
self._task = task
self._device = torch.device(device or "cpu")
self._robot_type = robot_type
self._processed_action_queue: deque[torch.Tensor] = deque()
self._relative_step = next(
(s for s in preprocessor.steps if isinstance(s, RelativeActionsProcessorStep) and s.enabled),
None,
)
if self._relative_step is not None and self._relative_step.action_names is None:
cfg_names = getattr(policy.config, "action_feature_names", None)
action_names = cfg_names or dataset_features.get(ACTION, {}).get("names")
if action_names:
self._relative_step.action_names = list(action_names)
logger.info("Relative actions enabled: sync chunks will be postprocessed before queueing")
logger.info(
"SyncInferenceEngine initialized (device=%s, action_keys=%d)",
self._device,
@@ -93,29 +93,9 @@ class SyncInferenceEngine(InferenceEngine):
self._policy.reset()
self._preprocessor.reset()
self._postprocessor.reset()
self._processed_action_queue.clear()
def _enqueue_processed_chunk(self, action_chunk: torch.Tensor) -> None:
"""Queue postprocessed per-step actions in policy output order."""
if action_chunk.ndim == 2:
action_chunk = action_chunk.unsqueeze(0)
n_action_steps = getattr(self._policy.config, "n_action_steps", action_chunk.shape[1])
action_chunk = action_chunk[:, : min(n_action_steps, action_chunk.shape[1])]
for action in action_chunk.squeeze(0):
action_tensor = action.detach().cpu()
if len(action_tensor) != len(self._ordered_action_keys):
raise ValueError(
f"Action tensor length ({len(action_tensor)}) != action keys "
f"({len(self._ordered_action_keys)})"
)
self._processed_action_queue.append(action_tensor)
def get_action(self, obs_frame: dict | None) -> torch.Tensor | None:
"""Run the full inference pipeline on ``obs_frame`` and return an action tensor."""
if self._processed_action_queue:
return self._processed_action_queue.popleft().clone()
if obs_frame is None:
return None
# Shallow copy is intentional: the caller (`send_next_action`) builds
@@ -132,10 +112,11 @@ class SyncInferenceEngine(InferenceEngine):
observation, self._device, self._task, self._robot_type
)
observation = self._preprocessor(observation)
action_chunk = self._policy.predict_action_chunk(observation)
processed_chunk = self._postprocessor(action_chunk)
action = self._policy.select_action(observation)
action = self._postprocessor(action)
action_tensor = action.squeeze(0).cpu()
self._enqueue_processed_chunk(processed_chunk)
if not self._processed_action_queue:
return None
return self._processed_action_queue.popleft().clone()
# Reorder to match dataset action ordering so the caller can treat
# the returned tensor uniformly across backends.
action_dict = make_robot_action(action_tensor, self._dataset_features)
return torch.tensor([action_dict[k] for k in self._ordered_action_keys])
+1 -1
View File
@@ -47,7 +47,7 @@ class RolloutRingBuffer:
count.
"""
def __init__(self, max_seconds: float = 30.0, max_memory_mb: float = 2048.0, fps: float = 30.0) -> None:
def __init__(self, max_seconds: float = 30.0, max_memory_mb: int = 2048, fps: float = 30.0) -> None:
self._max_frames = int(max_seconds * fps)
self._max_bytes = int(max_memory_mb * 1024 * 1024)
self._buffer: deque[dict] = deque(maxlen=self._max_frames)
+5 -3
View File
@@ -60,8 +60,7 @@ class BaseStrategy(RolloutStrategy):
break
obs = robot.get_observation()
obs_processed = ctx.processors.robot_observation_processor(obs)
engine.notify_observation(obs_processed)
obs_processed = self._process_observation_and_notify(ctx.processors, obs)
if self._handle_warmup(cfg.use_torch_compile, loop_start, control_interval):
continue
@@ -79,5 +78,8 @@ class BaseStrategy(RolloutStrategy):
def teardown(self, ctx: RolloutContext) -> None:
"""Disconnect hardware and stop inference."""
self._teardown_hardware(ctx.hardware)
self._teardown_hardware(
ctx.hardware,
return_to_initial_position=ctx.runtime.cfg.return_to_initial_position,
)
logger.info("Base strategy teardown complete")
+43 -14
View File
@@ -32,7 +32,7 @@ from ..inference import InferenceEngine
if TYPE_CHECKING:
from ..configs import RolloutStrategyConfig
from ..context import HardwareContext, RolloutContext, RuntimeContext
from ..context import HardwareContext, ProcessorContext, RolloutContext, RuntimeContext
logger = logging.getLogger(__name__)
@@ -50,6 +50,7 @@ class RolloutStrategy(abc.ABC):
self._engine: InferenceEngine | None = None
self._interpolator: ActionInterpolator | None = None
self._warmup_flushed: bool = False
self._cached_obs_processed: dict | None = None
def _init_engine(self, ctx: RolloutContext) -> None:
"""Attach the inference engine and action interpolator, then start the backend.
@@ -65,8 +66,32 @@ class RolloutStrategy(abc.ABC):
self._engine.reset()
self._engine.start()
self._warmup_flushed = False
self._cached_obs_processed = None
logger.info("Inference engine started")
def _process_observation_and_notify(self, processors: ProcessorContext, obs_raw: dict) -> dict:
"""Run the observation processor and notify the engine — throttled to policy ticks.
Callers are responsible for calling ``robot.get_observation()`` every loop
iteration so ``obs_raw`` stays fresh for the action post-processor. This
helper gates only the comparatively expensive bits the processor pipeline
and ``engine.notify_observation`` to fire when the interpolator signals
it needs a new action (once per ``interpolation_multiplier`` ticks). On
interpolated ticks the cached ``obs_processed`` is reused.
With ``interpolation_multiplier == 1`` this is equivalent to the unthrottled
path: ``needs_new_action()`` is True every tick.
The cache is implicitly invalidated whenever ``interpolator.reset()`` is
called (warmup completion, DAgger phase transitions back to AUTONOMOUS),
because reset makes ``needs_new_action()`` return True on the next call.
"""
if self._cached_obs_processed is None or self._interpolator.needs_new_action():
obs_processed = processors.robot_observation_processor(obs_raw)
self._engine.notify_observation(obs_processed)
self._cached_obs_processed = obs_processed
return self._cached_obs_processed
def _handle_warmup(self, use_torch_compile: bool, loop_start: float, control_interval: float) -> bool:
"""Handle torch.compile warmup phase.
@@ -91,16 +116,20 @@ class RolloutStrategy(abc.ABC):
engine.resume()
return False
def _teardown_hardware(self, hw: HardwareContext) -> None:
"""Stop the inference engine, return robot to initial position, and disconnect hardware."""
def _teardown_hardware(self, hw: HardwareContext, return_to_initial_position: bool = True) -> None:
"""Stop the inference engine, optionally return robot to initial position, and disconnect hardware."""
if self._engine is not None:
logger.info("Stopping inference engine...")
self._engine.stop()
robot = hw.robot_wrapper.inner
if robot.is_connected:
if hw.initial_position:
if return_to_initial_position and hw.initial_position:
logger.info("Returning robot to initial position before shutdown...")
self._return_to_initial_position(hw)
elif not return_to_initial_position:
logger.info(
"Skipping return-to-initial-position (disabled by config); leaving robot in final pose."
)
logger.info("Disconnecting robot...")
robot.disconnect()
teleop = hw.teleop
@@ -194,7 +223,7 @@ def estimate_max_episode_seconds(
The estimate ignores codec-specific settings (CRF, preset) on purpose:
we only need a rough lower bound on bitrate, not a precise prediction.
Falls back to 600 s (10 min) when no video features are present.
Falls back to 300 s (5 min) when no video features are present.
"""
# 0.1 bits-per-pixel is a *low* estimate for CRF-30 streaming video of
# robot footage (real-world is typically 0.1 0.3 bpp). Under-
@@ -208,16 +237,16 @@ def estimate_max_episode_seconds(
if feat.get("dtype") == "video":
shape = feat.get("shape", ())
# Assuming shape could be (C, H, W) or (T, C, H, W)
# We want to extract the spatial dimensions.
if len(shape) >= 3:
h, w = shape[-2], shape[-1]
pixels = h * w
if pixels > 0:
camera_pixels.append(pixels)
# (H, W, C) — bits-per-pixel is a per-spatial-pixel metric,
# so we exclude the channel dimension from the count.
if len(shape) == 3:
pixels = shape[0] * shape[1]
camera_pixels.append(pixels)
else:
raise ValueError(f"Unexpected video feature shape: {shape}")
if not camera_pixels:
return 600.0
return 300.0
# Use the smallest camera: it produces the lowest bitrate and therefore
# takes the longest to reach the target — the conservative choice.
@@ -227,7 +256,7 @@ def estimate_max_episode_seconds(
# Guard against division by zero just in case
if bytes_per_second <= 0:
return 600.0
return 300.0
return (target_size_mb * 1024 * 1024) / bytes_per_second
+51 -31
View File
@@ -24,14 +24,21 @@ the ``input_device`` config field. Each device exposes three actions:
1. **pause_resume** Toggle policy execution (AUTONOMOUS <-> PAUSED).
2. **correction** Toggle correction recording (PAUSED <-> CORRECTING).
3. **upload** Push dataset to hub on demand (corrections-only mode).
ESC (keyboard only) Stop session.
ESC (keyboard only) Stop session.
Recording Modes:
Recording modes:
``record_autonomous=True``: Sentry-like continuous recording with
time-based episode rotation. Both autonomous and correction
frames are recorded; corrections tagged ``intervention=True``.
``record_autonomous=False``: Only correction windows are recorded.
Each correction (start to stop) becomes one episode.
Teleoperator expectations:
The user is responsible for keeping the leader arm aligned with the
follower arm at the moment a correction begins. Programmatic motor
handover (``enable_torque`` / ``disable_torque`` / ``write_goal_positions``)
is intentionally not invoked here see the TODO in
:func:`DAggerStrategy._apply_transition` for the open design decision.
"""
from __future__ import annotations
@@ -168,8 +175,10 @@ class DAggerEvents:
# ---------------------------------------------------------------------------
# TODO(Steven): either enforce this (meaning all teleop must implement these methods) or
# user is responsible for moving the teleop to the same position as the robot when starting the correction.
# TODO(Steven): re-enable programmatic teleop alignment once we decide whether
# to enforce motor-control methods on every Teleoperator. Until then the user
# is responsible for moving the leader arm to the follower's pose at the moment
# a correction begins.
def _teleop_smooth_move_to(
teleop: Teleoperator, target_pos: dict, duration_s: float = 2.0, fps: int = 50
) -> None:
@@ -371,7 +380,10 @@ class DAggerStrategy(RolloutStrategy):
logger.info("Dataset uploaded to hub")
log_say("Dataset uploaded to hub", play_sounds)
self._teardown_hardware(ctx.hardware)
self._teardown_hardware(
ctx.hardware,
return_to_initial_position=ctx.runtime.cfg.return_to_initial_position,
)
logger.info("DAgger strategy teardown complete")
# ------------------------------------------------------------------
@@ -403,8 +415,8 @@ class DAggerStrategy(RolloutStrategy):
engine.reset()
interpolator.reset()
events.reset()
# TODO(Steven): either enforce this (meaning all teleop must implement these methods) or
# user is responsible for moving the teleop to the same position as the robot when starting the correction.
# TODO(Steven): re-enable once Teleoperator motor-control methods are
# standardised; until then the user pre-aligns the leader by hand.
# teleop.disable_torque()
engine.resume()
@@ -434,19 +446,22 @@ class DAggerStrategy(RolloutStrategy):
phase = events.phase
obs = robot.get_observation()
obs_processed = ctx.processors.robot_observation_processor(obs)
obs_frame = build_dataset_frame(features, obs_processed, prefix=OBS_STR)
# --- CORRECTING: human teleop control ---
# TODO(Steven): teleop runs at the same FPS as the policy. To
# decouple the two, sample teleop at its native rate and
# interpolate to the control loop's tick rate.
if phase == DAggerPhase.CORRECTING:
obs_processed = ctx.processors.robot_observation_processor(obs)
teleop_action = teleop.get_action()
processed_teleop = ctx.processors.teleop_action_processor((teleop_action, obs))
robot_action_to_send = ctx.processors.robot_action_processor((processed_teleop, obs))
robot.send_action(robot_action_to_send)
last_action = robot_action_to_send
self._log_telemetry(obs_processed, processed_teleop, ctx.runtime)
action_frame = build_dataset_frame(features, processed_teleop, prefix=ACTION)
if record_tick % record_stride == 0:
obs_frame = build_dataset_frame(features, obs_processed, prefix=OBS_STR)
action_frame = build_dataset_frame(features, processed_teleop, prefix=ACTION)
frame = {
**obs_frame,
**action_frame,
@@ -463,7 +478,7 @@ class DAggerStrategy(RolloutStrategy):
# --- AUTONOMOUS: policy control ---
else:
engine.notify_observation(obs_processed)
obs_processed = self._process_observation_and_notify(ctx.processors, obs)
if self._handle_warmup(cfg.use_torch_compile, loop_start, control_interval):
continue
@@ -472,8 +487,9 @@ class DAggerStrategy(RolloutStrategy):
if action_dict is not None:
self._log_telemetry(obs_processed, action_dict, ctx.runtime)
last_action = ctx.processors.robot_action_processor((action_dict, obs))
action_frame = build_dataset_frame(features, action_dict, prefix=ACTION)
if record_tick % record_stride == 0:
obs_frame = build_dataset_frame(features, obs_processed, prefix=OBS_STR)
action_frame = build_dataset_frame(features, action_dict, prefix=ACTION)
frame = {
**obs_frame,
**action_frame,
@@ -483,9 +499,9 @@ class DAggerStrategy(RolloutStrategy):
dataset.add_frame(frame)
record_tick += 1
# Episode rotation derived from video file-size target.
# Do NOT save mid-correction — wait for the correction
# to finish so the episode boundary is clean.
# Episode rotation derived from the video file-size target.
# Saving is deferred while a correction is ongoing so the
# episode boundary lands on a clean autonomous frame.
elapsed = time.perf_counter() - episode_start
if elapsed >= episode_duration_s and phase != DAggerPhase.CORRECTING:
with self._episode_lock:
@@ -516,8 +532,8 @@ class DAggerStrategy(RolloutStrategy):
finally:
logger.info("DAgger continuous control loop ended — pausing engine")
engine.pause()
# TODO(Steven): either enforce this (meaning all teleop must implement these methods) or
# user is responsible for moving the teleop to the same position as the robot when starting the correction.
# TODO(Steven): re-enable once Teleoperator motor-control methods
# are standardised across all teleop implementations.
# teleop.disable_torque()
with contextlib.suppress(Exception):
with self._episode_lock:
@@ -554,8 +570,8 @@ class DAggerStrategy(RolloutStrategy):
engine.reset()
interpolator.reset()
events.reset()
# TODO(Steven): either enforce this (meaning all teleop must implement these methods) or
# user is responsible for moving the teleop to the same position as the robot when starting the correction.
# TODO(Steven): re-enable once Teleoperator motor-control methods are
# standardised; until then the user pre-aligns the leader by hand.
# teleop.disable_torque()
engine.resume()
@@ -608,10 +624,13 @@ class DAggerStrategy(RolloutStrategy):
phase = events.phase
obs = robot.get_observation()
obs_processed = ctx.processors.robot_observation_processor(obs)
# --- CORRECTING: human teleop control + recording ---
# TODO(Steven): teleop runs at the same FPS as the policy. To
# decouple the two, sample teleop at its native rate and
# interpolate to the control loop's tick rate.
if phase == DAggerPhase.CORRECTING:
obs_processed = ctx.processors.robot_observation_processor(obs)
teleop_action = teleop.get_action()
processed_teleop = ctx.processors.teleop_action_processor((teleop_action, obs))
robot_action_to_send = ctx.processors.robot_action_processor((processed_teleop, obs))
@@ -619,9 +638,9 @@ class DAggerStrategy(RolloutStrategy):
last_action = robot_action_to_send
self._log_telemetry(obs_processed, processed_teleop, ctx.runtime)
obs_frame = build_dataset_frame(features, obs_processed, prefix=OBS_STR)
action_frame = build_dataset_frame(features, processed_teleop, prefix=ACTION)
if record_tick % record_stride == 0:
obs_frame = build_dataset_frame(features, obs_processed, prefix=OBS_STR)
action_frame = build_dataset_frame(features, processed_teleop, prefix=ACTION)
dataset.add_frame(
{
**obs_frame,
@@ -639,7 +658,7 @@ class DAggerStrategy(RolloutStrategy):
# --- AUTONOMOUS: policy control (no recording) ---
else:
engine.notify_observation(obs_processed)
obs_processed = self._process_observation_and_notify(ctx.processors, obs)
if self._handle_warmup(cfg.use_torch_compile, loop_start, control_interval):
continue
@@ -660,8 +679,8 @@ class DAggerStrategy(RolloutStrategy):
finally:
logger.info("DAgger corrections-only loop ended — pausing engine")
engine.pause()
# TODO(Steven): either enforce this (meaning all teleop must implement these methods) or
# user is responsible for moving the teleop to the same position as the robot when starting the correction.
# TODO(Steven): re-enable once Teleoperator motor-control methods
# are standardised across all teleop implementations.
# teleop.disable_torque()
with contextlib.suppress(Exception):
with self._episode_lock:
@@ -691,15 +710,16 @@ class DAggerStrategy(RolloutStrategy):
_robot_pos = {
k: v for k, v in obs.items() if k.endswith(".pos") and k in robot.observation_features
}
# TODO(Steven): either enforce this (meaning all teleop must implement these methods) or
# user is responsible for moving the teleop to the same position as the robot when starting the correction.
# Consider also a method that moves the robot to the teleop smoothly (similar to what we do at HW shutdown).
# _teleop_smooth_move_to(teleop, robot_pos, duration_s=2.0, fps=50)
# TODO(Steven): once Teleoperator motor-control methods are
# standardised, drive the leader to the follower's pose here so the
# operator does not need to pre-align the arm by hand. Until then
# the user is responsible for the alignment.
# _teleop_smooth_move_to(teleop, _robot_pos, duration_s=2.0, fps=50)
elif new_phase == DAggerPhase.CORRECTING:
logger.info("Entering correction mode — human teleop control")
# TODO(Steven): either enforce this (meaning all teleop must implement these methods) or
# user is responsible for moving the teleop to the same position as the robot when starting the correction.
# TODO(Steven): re-enable once Teleoperator motor-control methods
# are standardised across all teleop implementations.
# teleop.disable_torque()
elif new_phase == DAggerPhase.AUTONOMOUS:
+7 -5
View File
@@ -64,8 +64,8 @@ class HighlightStrategy(RolloutStrategy):
3. The episode is saved and the ring buffer resumes capturing.
Requires ``streaming_encoding=True`` (enforced in config validation)
so that ``dataset.add_frame`` is a non-blocking queue put draining
900 frames stays sub-ms per frame.
so that ``dataset.add_frame`` is a non-blocking queue put flushing
the entire ring buffer in one tick must not stall the control loop.
"""
config: HighlightStrategyConfig
@@ -135,8 +135,7 @@ class HighlightStrategy(RolloutStrategy):
break
obs = robot.get_observation()
obs_processed = ctx.processors.robot_observation_processor(obs)
engine.notify_observation(obs_processed)
obs_processed = self._process_observation_and_notify(ctx.processors, obs)
if self._handle_warmup(cfg.use_torch_compile, loop_start, control_interval):
continue
@@ -228,7 +227,10 @@ class HighlightStrategy(RolloutStrategy):
logger.info("Dataset uploaded to hub")
log_say("Dataset uploaded to hub", play_sounds)
self._teardown_hardware(ctx.hardware)
self._teardown_hardware(
ctx.hardware,
return_to_initial_position=ctx.runtime.cfg.return_to_initial_position,
)
logger.info("Highlight strategy teardown complete")
def _setup_keyboard(self, shutdown_event: ThreadingEvent) -> None:
+5 -3
View File
@@ -111,8 +111,7 @@ class SentryStrategy(RolloutStrategy):
break
obs = robot.get_observation()
obs_processed = ctx.processors.robot_observation_processor(obs)
engine.notify_observation(obs_processed)
obs_processed = self._process_observation_and_notify(ctx.processors, obs)
if self._handle_warmup(cfg.use_torch_compile, loop_start, control_interval):
continue
@@ -197,7 +196,10 @@ class SentryStrategy(RolloutStrategy):
logger.info("Dataset uploaded to hub")
log_say("Dataset uploaded to hub", play_sounds)
self._teardown_hardware(ctx.hardware)
self._teardown_hardware(
ctx.hardware,
return_to_initial_position=ctx.runtime.cfg.return_to_initial_position,
)
logger.info("Sentry strategy teardown complete")
def _background_push(self, dataset, cfg) -> None:
@@ -70,6 +70,7 @@ from lerobot.datasets.io_utils import (
get_parquet_file_size_in_mb,
get_parquet_num_frames,
load_info,
load_json,
write_episodes,
write_info,
write_stats,
@@ -81,9 +82,11 @@ from lerobot.datasets.utils import (
DEFAULT_DATA_PATH,
DEFAULT_VIDEO_FILE_SIZE_IN_MB,
DEFAULT_VIDEO_PATH,
INFO_PATH,
LEGACY_EPISODES_PATH,
LEGACY_EPISODES_STATS_PATH,
LEGACY_TASKS_PATH,
DatasetInfo,
update_chunk_file_indices,
)
from lerobot.datasets.video_utils import concatenate_video_files, get_video_duration_in_s
@@ -165,7 +168,7 @@ def legacy_load_tasks(local_dir: Path) -> tuple[dict, dict]:
def validate_local_dataset_version(local_path: Path) -> None:
"""Validate that the local dataset has the expected v2.1 version."""
info = load_info(local_path)
dataset_version = info.get("codebase_version", "unknown")
dataset_version = info.codebase_version or "unknown"
if dataset_version != V21:
raise ValueError(
f"Local dataset has codebase version '{dataset_version}', expected '{V21}'. "
@@ -256,14 +259,14 @@ def convert_data(root: Path, new_root: Path, data_file_size_in_mb: int):
def get_video_keys(root):
info = load_info(root)
features = info["features"]
features = info.features
video_keys = [key for key, ft in features.items() if ft["dtype"] == "video"]
return video_keys
def get_image_keys(root):
info = load_info(root)
features = info["features"]
features = info.features
image_keys = [key for key, ft in features.items() if ft["dtype"] == "image"]
return image_keys
@@ -434,7 +437,8 @@ def convert_episodes_metadata(root, new_root, episodes_metadata, episodes_video_
def convert_info(root, new_root, data_file_size_in_mb, video_file_size_in_mb):
info = load_info(root)
# Load as raw dict to remove legacy v2.1 fields before constructing DatasetInfo.
info = load_json(root / INFO_PATH)
info["codebase_version"] = V30
del info["total_chunks"]
del info["total_videos"]
@@ -449,7 +453,9 @@ def convert_info(root, new_root, data_file_size_in_mb, video_file_size_in_mb):
# already has fps in video_info
continue
info["features"][key]["fps"] = info["fps"]
write_info(info, new_root)
# Convert raw dict to typed DatasetInfo before writing
dataset_info = DatasetInfo.from_dict(info)
write_info(dataset_info, new_root)
def convert_dataset(
+93 -18
View File
@@ -49,6 +49,14 @@ Delete episodes and save to a new dataset at a specific path and with a new repo
--operation.type delete_episodes \
--operation.episode_indices "[0, 2, 5]"
Delete episodes and re-encode video segments with h264:
lerobot-edit-dataset \
--repo_id lerobot/pusht \
--operation.type delete_episodes \
--operation.episode_indices "[0, 2, 5]" \
--operation.camera_encoder_config.vcodec h264 \
--operation.camera_encoder_config.crf 23
Split dataset by fractions (pusht_train, pusht_val):
lerobot-edit-dataset \
--repo_id lerobot/pusht \
@@ -74,6 +82,14 @@ Split into more than two splits:
--operation.type split \
--operation.splits '{"train": 0.6, "val": 0.2, "test": 0.2}'
Split dataset and re-encode video segments with h264:
lerobot-edit-dataset \
--repo_id lerobot/pusht \
--operation.type split \
--operation.splits '{"train": 0.8, "val": 0.2}' \
--operation.camera_encoder_config.vcodec h264 \
--operation.camera_encoder_config.crf 23
Merge multiple datasets:
lerobot-edit-dataset \
--new_repo_id lerobot/pusht_merged \
@@ -150,11 +166,24 @@ Show dataset information without feature details:
--operation.type info \
--operation.show_features false
Recompute dataset statistics:
Recompute dataset statistics (saves to lerobot/pusht_recomputed_stats by default):
lerobot-edit-dataset \
--repo_id lerobot/pusht \
--operation.type recompute_stats
Recompute stats and save to a specific new repo_id:
lerobot-edit-dataset \
--repo_id lerobot/pusht \
--new_repo_id lerobot/pusht_new_stats \
--operation.type recompute_stats
Recompute stats in-place (overwrites original dataset stats):
lerobot-edit-dataset \
--repo_id lerobot/pusht \
--new_repo_id lerobot/pusht \
--operation.type recompute_stats \
--operation.overwrite true
Recompute stats for relative actions and push to hub:
lerobot-edit-dataset \
--repo_id lerobot/pusht \
@@ -174,7 +203,7 @@ import abc
import logging
import shutil
import sys
from dataclasses import dataclass
from dataclasses import dataclass, field
from pathlib import Path
import draccus
@@ -182,6 +211,8 @@ import draccus
from lerobot.configs import parser
from lerobot.datasets import (
LeRobotDataset,
VideoEncoderConfig,
camera_encoder_defaults,
convert_image_to_video_dataset,
delete_episodes,
merge_datasets,
@@ -205,12 +236,14 @@ class OperationConfig(draccus.ChoiceRegistry, abc.ABC):
@dataclass
class DeleteEpisodesConfig(OperationConfig):
episode_indices: list[int] | None = None
camera_encoder_config: VideoEncoderConfig = field(default_factory=camera_encoder_defaults)
@OperationConfig.register_subclass("split")
@dataclass
class SplitConfig(OperationConfig):
splits: dict[str, float | list[int]] | None = None
camera_encoder_config: VideoEncoderConfig = field(default_factory=camera_encoder_defaults)
@OperationConfig.register_subclass("merge")
@@ -237,11 +270,7 @@ class ModifyTasksConfig(OperationConfig):
@dataclass
class ConvertImageToVideoConfig(OperationConfig):
output_dir: str | None = None
vcodec: str = "libsvtav1"
pix_fmt: str = "yuv420p"
g: int = 2
crf: int = 30
fast_decode: int = 0
camera_encoder_config: VideoEncoderConfig = field(default_factory=camera_encoder_defaults)
episode_indices: list[int] | None = None
num_workers: int = 4
max_episodes_per_batch: int | None = None
@@ -256,6 +285,7 @@ class RecomputeStatsConfig(OperationConfig):
relative_exclude_joints: list[str] | None = None
chunk_size: int = 50
num_workers: int = 0
overwrite: bool = False
@OperationConfig.register_subclass("info")
@@ -280,16 +310,30 @@ class EditDatasetConfig:
push_to_hub: bool = False
def _resolve_io_paths(
repo_id: str,
new_repo_id: str | None,
root: Path | str | None,
new_root: Path | str | None,
default_new_repo_id: str | None = None,
) -> tuple[str, Path, Path]:
"""Resolve input/output paths and repo_id for dataset operations.
Returns (output_repo_id, input_path, output_path) with resolved (symlink-safe) paths.
"""
input_path = (Path(root) if root else HF_LEROBOT_HOME / repo_id).resolve()
output_repo_id = new_repo_id or default_new_repo_id or repo_id
output_path = (Path(new_root) if new_root else HF_LEROBOT_HOME / output_repo_id).resolve()
return output_repo_id, input_path, output_path
def get_output_path(
repo_id: str,
new_repo_id: str | None,
root: Path | str | None,
new_root: Path | str | None,
) -> tuple[str, Path]:
input_path = Path(root) if root else HF_LEROBOT_HOME / repo_id
output_repo_id = new_repo_id if new_repo_id else repo_id
output_path = Path(new_root) if new_root else HF_LEROBOT_HOME / output_repo_id
output_repo_id, input_path, output_path = _resolve_io_paths(repo_id, new_repo_id, root, new_root)
# In case of in-place modification, create a backup of the original dataset (if it exists)
if output_path == input_path:
@@ -328,6 +372,7 @@ def handle_delete_episodes(cfg: EditDatasetConfig) -> None:
episode_indices=cfg.operation.episode_indices,
output_dir=output_dir,
repo_id=output_repo_id,
camera_encoder_config=cfg.operation.camera_encoder_config,
)
logging.info(f"Dataset saved to {output_dir}")
@@ -359,6 +404,7 @@ def handle_split(cfg: EditDatasetConfig) -> None:
dataset,
splits=cfg.operation.splits,
output_dir=cfg.new_root,
camera_encoder_config=cfg.operation.camera_encoder_config,
)
for split_name, split_ds in split_datasets.items():
@@ -529,11 +575,8 @@ def handle_convert_image_to_video(cfg: EditDatasetConfig) -> None:
dataset=dataset,
output_dir=output_dir,
repo_id=output_repo_id,
vcodec=getattr(cfg.operation, "vcodec", "libsvtav1"),
pix_fmt=getattr(cfg.operation, "pix_fmt", "yuv420p"),
g=getattr(cfg.operation, "g", 2),
crf=getattr(cfg.operation, "crf", 30),
fast_decode=getattr(cfg.operation, "fast_decode", 0),
camera_encoder_config=getattr(cfg.operation, "camera_encoder_config", None)
or camera_encoder_defaults(),
episode_indices=getattr(cfg.operation, "episode_indices", None),
num_workers=getattr(cfg.operation, "num_workers", 4),
max_episodes_per_batch=getattr(cfg.operation, "max_episodes_per_batch", None),
@@ -557,7 +600,39 @@ def handle_recompute_stats(cfg: EditDatasetConfig) -> None:
if not isinstance(cfg.operation, RecomputeStatsConfig):
raise ValueError("Operation config must be RecomputeStatsConfig")
dataset = LeRobotDataset(cfg.repo_id, root=cfg.root)
# Determine whether this is an in-place operation
output_repo_id, input_root, output_root = _resolve_io_paths(
cfg.repo_id,
cfg.new_repo_id,
cfg.root,
cfg.new_root,
default_new_repo_id=f"{cfg.repo_id}_recomputed_stats",
)
in_place = output_root == input_root
if in_place and not cfg.operation.overwrite:
raise ValueError(
f"recompute_stats would overwrite the dataset in-place at {input_root}. "
"Pass --operation.overwrite true to allow in-place modification, "
"or use --new_repo_id / --new_root to write to a different location. "
f"Default output repo_id when neither is set: '{cfg.repo_id}_recomputed_stats'."
)
if in_place:
logging.warning(
f"Overwriting dataset stats in-place at {input_root}. The original stats will be lost."
)
dataset = LeRobotDataset(cfg.repo_id, root=input_root)
else:
logging.info(f"Copying dataset from {input_root} to {output_root}")
if output_root.exists():
backup_path = output_root.with_name(output_root.name + "_old")
logging.warning(f"Output directory {output_root} already exists. Moving to {backup_path}")
if backup_path.exists():
shutil.rmtree(backup_path)
shutil.move(output_root, backup_path)
shutil.copytree(input_root, output_root)
dataset = LeRobotDataset(output_repo_id, root=output_root)
logging.info(f"Recomputing stats for {cfg.repo_id}")
if cfg.operation.relative_action:
@@ -578,7 +653,7 @@ def handle_recompute_stats(cfg: EditDatasetConfig) -> None:
logging.info(f"Stats written to {dataset.root}")
if cfg.push_to_hub:
logging.info(f"Pushing to hub as {dataset.meta.repo_id}...")
logging.info(f"Pushing to hub as {dataset.repo_id}...")
dataset.push_to_hub()
+36 -7
View File
@@ -63,6 +63,27 @@ lerobot-record \\
--dataset.streaming_encoding=true \\
--dataset.encoder_threads=2
```
Example recording with custom video encoding parameters:
```shell
lerobot-record \\
--robot.type=so100_follower \\
--robot.port=/dev/tty.usbmodem58760431541 \\
--robot.cameras="{laptop: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \\
--robot.id=black \\
--teleop.type=so100_leader \\
--teleop.port=/dev/tty.usbmodem58760431551 \\
--teleop.id=blue \\
--dataset.repo_id=<my_username>/<my_dataset_name> \\
--dataset.num_episodes=2 \\
--dataset.single_task="Grab the cube" \\
--dataset.streaming_encoding=true \\
--dataset.encoder_threads=2 \\
--dataset.camera_encoder_config.vcodec=h264 \\
--dataset.camera_encoder_config.preset=fast \\
--dataset.camera_encoder_config.extra_options={"tune": "film", "profile:v": "high", "bf": 2} \\
--display_data=true
```
"""
import logging
@@ -83,10 +104,12 @@ from lerobot.common.control_utils import (
from lerobot.configs import parser
from lerobot.configs.dataset import DatasetRecordConfig
from lerobot.datasets import (
DepthEncoderConfig,
LeRobotDataset,
VideoEncodingManager,
aggregate_pipeline_dataset_features,
create_initial_features,
depth_encoder_defaults,
safe_stop_image_writer,
)
from lerobot.processor import (
@@ -305,7 +328,10 @@ def record_loop(
if display_data:
log_rerun_data(
observation=obs_processed, action=action_values, compress_images=display_compressed_images
observation=obs_processed,
action=action_values,
compress_images=display_compressed_images,
features=dataset.features if dataset is not None else None,
)
dt_s = time.perf_counter() - start_loop_t
@@ -377,10 +403,11 @@ def record(
cfg.dataset.repo_id,
root=cfg.dataset.root,
batch_encoding_size=cfg.dataset.video_encoding_batch_size,
vcodec=cfg.dataset.vcodec,
camera_encoder_config=cfg.dataset.camera_encoder_config,
depth_encoder_config=cfg.dataset.depth_encoder_config,
encoder_threads=cfg.dataset.encoder_threads,
streaming_encoding=cfg.dataset.streaming_encoding,
encoder_queue_maxsize=cfg.dataset.encoder_queue_maxsize,
encoder_threads=cfg.dataset.encoder_threads,
image_writer_processes=cfg.dataset.num_image_writer_processes if num_cameras > 0 else 0,
image_writer_threads=cfg.dataset.num_image_writer_threads_per_camera * num_cameras
if num_cameras > 0
@@ -389,7 +416,8 @@ def record(
sanity_check_dataset_robot_compatibility(dataset, robot, cfg.dataset.fps, dataset_features)
else:
# Reject eval_ prefix — for policy evaluation use lerobot-rollout
if cfg.dataset.repo_id.startswith("eval_"):
repo_name = cfg.dataset.repo_id.split("/", 1)[-1]
if repo_name.startswith("eval_"):
raise ValueError(
"Dataset names starting with 'eval_' are reserved for policy evaluation. "
"lerobot-record is for data collection only. Use lerobot-rollout for policy deployment."
@@ -405,10 +433,11 @@ def record(
image_writer_processes=cfg.dataset.num_image_writer_processes,
image_writer_threads=cfg.dataset.num_image_writer_threads_per_camera * len(robot.cameras),
batch_encoding_size=cfg.dataset.video_encoding_batch_size,
vcodec=cfg.dataset.vcodec,
camera_encoder_config=cfg.dataset.camera_encoder_config,
depth_encoder_config=cfg.dataset.depth_encoder_config,
encoder_threads=cfg.dataset.encoder_threads,
streaming_encoding=cfg.dataset.streaming_encoding,
encoder_queue_maxsize=cfg.dataset.encoder_queue_maxsize,
encoder_threads=cfg.dataset.encoder_threads,
)
robot.connect()
@@ -419,7 +448,7 @@ def record(
if not cfg.dataset.streaming_encoding:
logging.info(
"Streaming encoding is disabled. If you have capable hardware, consider enabling it for way faster episode saving. --dataset.streaming_encoding=true --dataset.encoder_threads=2 # --dataset.vcodec=auto. More info in the documentation: https://huggingface.co/docs/lerobot/streaming_video_encoding"
"Streaming encoding is disabled. If you have capable hardware, consider enabling it for way faster episode saving. --dataset.streaming_encoding=true --dataset.encoder_threads=2 # --dataset.camera_encoder_config.vcodec=auto. More info in the documentation: https://huggingface.co/docs/lerobot/streaming_video_encoding"
)
with VideoEncodingManager(dataset):
+95 -82
View File
@@ -47,6 +47,7 @@ from lerobot.datasets import EpisodeAwareSampler, make_dataset
from lerobot.envs import close_envs, make_env, make_env_pre_post_processors
from lerobot.optim.factory import make_optimizer_and_scheduler
from lerobot.policies import PreTrainedPolicy, make_policy, make_pre_post_processors
from lerobot.rewards import make_reward_pre_post_processors
from lerobot.utils.import_utils import register_third_party_plugins
from lerobot.utils.logging_utils import AverageMeter, MetricsTracker
from lerobot.utils.random_utils import set_seed
@@ -70,8 +71,8 @@ def update_policy(
accelerator: "Accelerator",
lr_scheduler=None,
lock=None,
rabc_weights_provider=None,
) -> tuple[MetricsTracker, dict]:
sample_weighter=None,
) -> tuple[MetricsTracker, dict | None]:
"""
Performs a single training step to update the policy's weights.
@@ -87,7 +88,7 @@ def update_policy(
accelerator: The Accelerator instance for distributed training and mixed precision.
lr_scheduler: An optional learning rate scheduler.
lock: An optional lock for thread-safe optimizer updates.
rabc_weights_provider: Optional RABCWeights instance for sample weighting.
sample_weighter: Optional SampleWeighter instance for per-sample loss weighting.
Returns:
A tuple containing:
@@ -97,27 +98,31 @@ def update_policy(
start_time = time.perf_counter()
policy.train()
# Get RA-BC weights if enabled
rabc_batch_weights = None
rabc_batch_stats = None
if rabc_weights_provider is not None:
rabc_batch_weights, rabc_batch_stats = rabc_weights_provider.compute_batch_weights(batch)
# Compute sample weights if a weighter is provided
sample_weights = None
weight_stats = None
if sample_weighter is not None:
sample_weights, weight_stats = sample_weighter.compute_batch_weights(batch)
# Let accelerator handle mixed precision
with accelerator.autocast():
# Use per-sample loss when RA-BC is enabled for proper weighting
if rabc_batch_weights is not None:
# Get per-sample losses
if sample_weights is not None:
# Use per-sample loss for weighted training
# Note: Policies supporting sample weighting must implement forward(batch, reduction="none")
per_sample_loss, output_dict = policy.forward(batch, reduction="none")
# Apply RA-BC weights: L_RA-BC = Σ(w_i * l_i) / (Σw_i + ε)
# rabc_batch_weights is already normalized to sum to batch_size
# Weighted loss: each sample's contribution is scaled by its weight.
# We divide by weight sum (not batch size) so that if some weights are zero,
# the remaining samples contribute proportionally more, preserving gradient scale.
# Weights are pre-normalized to sum to batch_size for stable training dynamics.
epsilon = 1e-6
loss = (per_sample_loss * rabc_batch_weights).sum() / (rabc_batch_weights.sum() + epsilon)
# Log raw mean weight (before normalization) - this is the meaningful metric
output_dict["rabc_mean_weight"] = rabc_batch_stats["raw_mean_weight"]
output_dict["rabc_num_zero_weight"] = rabc_batch_stats["num_zero_weight"]
output_dict["rabc_num_full_weight"] = rabc_batch_stats["num_full_weight"]
loss = (per_sample_loss * sample_weights).sum() / (sample_weights.sum() + epsilon)
# Log weighting statistics
if output_dict is None:
output_dict = {}
for key, value in weight_stats.items():
output_dict[f"sample_weight_{key}"] = value
else:
loss, output_dict = policy.forward(batch)
@@ -188,8 +193,8 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
# Accelerate auto-detects the device based on the available hardware and ignores the policy.device setting.
# Force the device to be CPU when policy.device is set to CPU.
force_cpu = cfg.policy.device == "cpu"
# Force the device to be CPU when the active config's device is set to CPU (works for both policy and reward model training).
force_cpu = cfg.trainable_config.device == "cpu"
accelerator = Accelerator(
step_scheduler_with_optimizer=False,
kwargs_handlers=[ddp_kwargs],
@@ -245,26 +250,44 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
logging.info("Creating env")
eval_env = make_env(cfg.env, n_envs=cfg.eval.batch_size, use_async_envs=cfg.eval.use_async_envs)
if is_main_process:
logging.info("Creating policy")
policy = make_policy(
cfg=cfg.policy,
ds_meta=dataset.meta,
rename_map=cfg.rename_map,
)
if cfg.is_reward_model_training:
if is_main_process:
logging.info("Creating reward model")
from lerobot.rewards import make_reward_model
policy = make_reward_model(
cfg=cfg.reward_model,
dataset_stats=dataset.meta.stats,
dataset_meta=dataset.meta,
)
if not policy.is_trainable:
raise ValueError(
f"Reward model '{policy.name}' is zero-shot and cannot be trained via lerobot-train. "
"Use it directly for inference via compute_reward() (e.g. offline precompute)."
)
else:
if is_main_process:
logging.info("Creating policy")
policy = make_policy(
cfg=cfg.policy,
ds_meta=dataset.meta,
rename_map=cfg.rename_map,
)
if cfg.peft is not None:
if cfg.is_reward_model_training:
raise ValueError("PEFT is only supported for policy training. ")
logging.info("Using PEFT! Wrapping model.")
# Convert CLI peft config to dict for overrides
peft_cli_overrides = dataclasses.asdict(cfg.peft)
policy = policy.wrap_with_peft(peft_cli_overrides=peft_cli_overrides)
# Wait for all processes to finish policy creation before continuing
# Wait for all processes to finish model creation before continuing
accelerator.wait_for_everyone()
processor_pretrained_path = cfg.policy.pretrained_path
active_cfg = cfg.trainable_config
processor_pretrained_path = active_cfg.pretrained_path
if (
getattr(cfg.policy, "use_relative_actions", False)
getattr(active_cfg, "use_relative_actions", False)
and processor_pretrained_path is not None
and not cfg.resume
):
@@ -274,18 +297,15 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
)
processor_pretrained_path = None
# Create processors - only provide dataset_stats if not resuming from saved processors
processor_kwargs = {}
postprocessor_kwargs = {}
if (processor_pretrained_path and not cfg.resume) or not processor_pretrained_path:
# Only provide dataset_stats when not resuming from saved processor state
processor_kwargs["dataset_stats"] = dataset.meta.stats
# For SARM, always provide dataset_meta for progress normalization
if cfg.policy.type == "sarm":
if cfg.is_reward_model_training:
processor_kwargs["dataset_meta"] = dataset.meta
if processor_pretrained_path is not None:
if not cfg.is_reward_model_training and processor_pretrained_path is not None:
processor_kwargs["preprocessor_overrides"] = {
"device_processor": {"device": device.type},
"normalizer_processor": {
@@ -305,38 +325,36 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
},
}
preprocessor, postprocessor = make_pre_post_processors(
policy_cfg=cfg.policy,
pretrained_path=processor_pretrained_path,
**processor_kwargs,
**postprocessor_kwargs,
)
if cfg.is_reward_model_training:
preprocessor, postprocessor = make_reward_pre_post_processors(
cfg.reward_model,
**processor_kwargs,
)
else:
preprocessor, postprocessor = make_pre_post_processors(
policy_cfg=cfg.policy,
pretrained_path=processor_pretrained_path,
**processor_kwargs,
**postprocessor_kwargs,
)
if is_main_process:
logging.info("Creating optimizer and scheduler")
optimizer, lr_scheduler = make_optimizer_and_scheduler(cfg, policy)
# Load precomputed SARM progress for RA-BC if enabled
# Generate progress using: src/lerobot/policies/sarm/compute_rabc_weights.py
rabc_weights = None
if cfg.use_rabc:
from lerobot.utils.rabc import RABCWeights
# Create sample weighter if configured (e.g., for RA-BC training)
sample_weighter = None
if cfg.sample_weighting is not None:
from lerobot.utils.sample_weighting import make_sample_weighter
# Get chunk_size from policy config
chunk_size = getattr(policy.config, "chunk_size", None)
if chunk_size is None:
raise ValueError("Chunk size is not found in policy config")
head_mode = getattr(cfg, "rabc_head_mode", "sparse")
logging.info(f"Loading SARM progress for RA-BC from {cfg.rabc_progress_path}")
logging.info(f"Using chunk_size={chunk_size} from policy config, head_mode={head_mode}")
rabc_weights = RABCWeights(
progress_path=cfg.rabc_progress_path,
chunk_size=chunk_size,
head_mode=head_mode,
kappa=getattr(cfg, "rabc_kappa", 0.01),
epsilon=getattr(cfg, "rabc_epsilon", 1e-6),
device=device,
if is_main_process:
logging.info(f"Creating sample weighter: {cfg.sample_weighting.type}")
sample_weighter = make_sample_weighter(
cfg.sample_weighting,
policy,
device,
dataset_root=cfg.dataset.root,
dataset_repo_id=cfg.dataset.repo_id,
)
step = 0 # number of policy updates (forward + backward + optim)
@@ -365,13 +383,13 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
logging.info(f"{num_total_params=} ({format_big_number(num_total_params)})")
# create dataloader for offline training
if hasattr(cfg.policy, "drop_n_last_frames"):
if hasattr(active_cfg, "drop_n_last_frames"):
shuffle = False
sampler = EpisodeAwareSampler(
dataset.meta.episodes["dataset_from_index"],
dataset.meta.episodes["dataset_to_index"],
episode_indices_to_use=dataset.episodes,
drop_n_last_frames=cfg.policy.drop_n_last_frames,
drop_n_last_frames=active_cfg.drop_n_last_frames,
shuffle=True,
)
else:
@@ -448,7 +466,7 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
cfg.optimizer.grad_clip_norm,
accelerator=accelerator,
lr_scheduler=lr_scheduler,
rabc_weights_provider=rabc_weights,
sample_weighter=sample_weighter,
)
# Note: eval and checkpoint happens *after* the `step`th training update has completed, so we
@@ -467,16 +485,10 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
wandb_log_dict = train_tracker.to_dict()
if output_dict:
wandb_log_dict.update(output_dict)
# Log RA-BC statistics if enabled
if rabc_weights is not None:
rabc_stats = rabc_weights.get_stats()
wandb_log_dict.update(
{
"rabc_delta_mean": rabc_stats["delta_mean"],
"rabc_delta_std": rabc_stats["delta_std"],
"rabc_num_frames": rabc_stats["num_frames"],
}
)
# Log sample weighting statistics if enabled
if sample_weighter is not None:
weighter_stats = sample_weighter.get_stats()
wandb_log_dict.update({f"sample_weighting/{k}": v for k, v in weighter_stats.items()})
wandb_logger.log_dict(wandb_log_dict, step)
train_tracker.reset_averages()
@@ -558,14 +570,15 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
if is_main_process:
logging.info("End of training")
if cfg.policy.push_to_hub:
unwrapped_policy = accelerator.unwrap_model(policy)
if cfg.policy.use_peft:
unwrapped_policy.push_model_to_hub(cfg, peft_model=unwrapped_policy)
if getattr(active_cfg, "push_to_hub", False):
unwrapped_model = accelerator.unwrap_model(policy)
# PEFT only applies when training a policy — reward models use the plain path.
if not cfg.is_reward_model_training and cfg.policy.use_peft:
unwrapped_model.push_model_to_hub(cfg, peft_model=unwrapped_model)
else:
unwrapped_policy.push_model_to_hub(cfg)
preprocessor.push_to_hub(cfg.policy.repo_id)
postprocessor.push_to_hub(cfg.policy.repo_id)
unwrapped_model.push_model_to_hub(cfg)
preprocessor.push_to_hub(active_cfg.repo_id)
postprocessor.push_to_hub(active_cfg.repo_id)
# Properly clean up the distributed process group
accelerator.wait_for_everyone()
@@ -0,0 +1,55 @@
---
# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
# Doc / guide: https://huggingface.co/docs/hub/model-cards
# prettier-ignore
{{card_data}}
---
# Reward Model Card for {{ model_name | default("Reward Model ID", true) }}
<!-- Provide a quick summary of what the reward model is/does. -->
{% if model_name == "reward_classifier" %}
A reward classifier is a lightweight neural network that scores observations or trajectories for task success, providing a learned reward signal or offline evaluation when explicit rewards are unavailable.
{% elif model_name == "sarm" %}
A Success-Aware Reward Model (SARM) predicts a dense reward signal from observations, typically used downstream for reinforcement learning or human-in-the-loop fine-tuning when task success is not directly observable.
{% else %}
_Reward model type not recognized — please update this template._
{% endif %}
This reward model has been trained and pushed to the Hub using [LeRobot](https://github.com/huggingface/lerobot).
See the full documentation at [LeRobot Docs](https://huggingface.co/docs/lerobot/index).
---
## How to Get Started with the Reward Model
### Train from scratch
```bash
lerobot-train \
--dataset.repo_id=${HF_USER}/<dataset> \
--reward_model.type={{ model_name | default("reward_classifier", true) }} \
--output_dir=outputs/train/<desired_reward_model_repo_id> \
--job_name=lerobot_reward_training \
--reward_model.device=cuda \
--reward_model.repo_id=${HF_USER}/<desired_reward_model_repo_id> \
--wandb.enable=true
```
_Writes checkpoints to `outputs/train/<desired_reward_model_repo_id>/checkpoints/`._
### Load the reward model in Python
```python
from lerobot.rewards import make_reward_model
reward_model = make_reward_model(pretrained_path="<hf_user>/<reward_model_repo_id>")
reward = reward_model.compute_reward(batch)
```
---
## Model Details
- **License:** {{ license | default("\[More Information Needed]", true) }}
+26 -6
View File
@@ -86,11 +86,24 @@ def hw_to_dataset_features(
}
for key, shape in cam_fts.items():
features[f"{prefix}.images.{key}"] = {
"dtype": "video" if use_video else "image",
"shape": shape,
"names": ["height", "width", "channels"],
}
if len(shape) == 2:
# Single-channel feature (e.g. depth map). The hardware-side key is
# expected to use a "_depth" suffix to disambiguate from its color
# counterpart; we strip it so the dataset feature is published as
# ``{prefix}.depth.<bare>`` and aligned with ``observation.images.<bare>``.
bare = key.removesuffix("_depth") if key.endswith("_depth") else key
features[f"{prefix}.depth.{bare}"] = {
"dtype": "video" if use_video else "image",
"shape": shape,
"names": ["height", "width"],
"info": {"video.is_depth_map": True},
}
else:
features[f"{prefix}.images.{key}"] = {
"dtype": "video" if use_video else "image",
"shape": shape,
"names": ["height", "width", "channels"],
}
_validate_feature_names(features)
return features
@@ -120,7 +133,14 @@ def build_dataset_frame(
elif ft["dtype"] == "float32" and len(ft["shape"]) == 1:
frame[key] = np.array([values[name] for name in ft["names"]], dtype=np.float32)
elif ft["dtype"] in ["image", "video"]:
frame[key] = values[key.removeprefix(f"{prefix}.images.")]
if key.startswith(f"{prefix}.depth."):
bare = key.removeprefix(f"{prefix}.depth.")
# Hardware emits depth values under "<bare>_depth" to disambiguate
# from the color stream stored at "<bare>" — fall back to the bare
# name when the producer already uses dataset-style keys.
frame[key] = values.get(f"{bare}_depth", values.get(bare))
else:
frame[key] = values[key.removeprefix(f"{prefix}.images.")]
return frame
+4 -2
View File
@@ -69,7 +69,7 @@ def is_package_available(
return package_exists
def get_safe_default_codec():
def get_safe_default_video_backend():
logger = logging.getLogger(__name__)
if importlib.util.find_spec("torchcodec"):
return "torchcodec"
@@ -115,7 +115,9 @@ _feetech_sdk_available = is_package_available("feetech-servo-sdk", import_name="
_reachy2_sdk_available = is_package_available("reachy2_sdk")
_can_available = is_package_available("python-can", "can")
_unitree_sdk_available = is_package_available("unitree-sdk2py", "unitree_sdk2py")
_pyrealsense2_available = is_package_available("pyrealsense2")
_pyrealsense2_available = is_package_available("pyrealsense2") or is_package_available(
"pyrealsense2-macosx", import_name="pyrealsense2"
)
_zmq_available = is_package_available("pyzmq", import_name="zmq")
_hebi_available = is_package_available("hebi-py", import_name="hebi")
_teleop_available = is_package_available("teleop")
+239
View File
@@ -0,0 +1,239 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Sample weighting abstraction for training.
This module provides an abstract base class for sample weighting strategies (e.g., RA-BC)
that can be used during training without polluting the training script with
policy-specific code.
Example usage:
# In training config
sample_weighting:
type: rabc
progress_path: hf://datasets/my-dataset/sarm_progress.parquet
head_mode: sparse
kappa: 0.01
# In training script
sample_weighter = make_sample_weighter(cfg.sample_weighting, policy, device, dataset_root=cfg.dataset.root, dataset_repo_id=cfg.dataset.repo_id)
...
weights, stats = sample_weighter.compute_batch_weights(batch)
"""
from __future__ import annotations
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from pathlib import Path
from typing import TYPE_CHECKING
import torch
if TYPE_CHECKING:
from lerobot.policies.pretrained import PreTrainedPolicy
class SampleWeighter(ABC):
"""
Implementations compute per-sample weights that can be used to weight
the loss during training. This enables techniques like:
- RA-BC (Reward-Aligned Behavior Cloning)
- Importance sampling
- Curriculum learning
- Quality-based filtering
"""
@abstractmethod
def compute_batch_weights(self, batch: dict) -> tuple[torch.Tensor, dict]:
"""
Compute per-sample weights for a training batch.
Args:
batch: Training batch dictionary containing at minimum an "index" key
with global frame indices.
"""
@abstractmethod
def get_stats(self) -> dict:
"""
Get global statistics about the weighting strategy.
"""
@dataclass
class SampleWeightingConfig:
"""
Configuration for sample weighting during training.
This is a generic config that supports multiple weighting strategies.
The `type` field determines which implementation to use, and `extra_params`
contains additional type-specific parameters.
Attributes:
type: Weighting strategy type ("rabc", "uniform", etc.)
progress_path: Path to precomputed progress values (for RABC)
head_mode: Which model head to use for progress ("sparse" or "dense")
kappa: Hard threshold for high-quality samples (RABC-specific)
epsilon: Small constant for numerical stability
extra_params: Additional type-specific parameters passed to the weighter
"""
type: str = "rabc"
progress_path: str | None = None
head_mode: str = "sparse"
kappa: float = 0.01
epsilon: float = 1e-6
# Additional type-specific params can be added here or passed via extra_params
extra_params: dict = field(default_factory=dict)
def make_sample_weighter(
config: SampleWeightingConfig | None,
policy: PreTrainedPolicy,
device: torch.device,
dataset_root: str | None = None,
dataset_repo_id: str | None = None,
) -> SampleWeighter | None:
"""
Factory function to create a SampleWeighter from config.
This keeps policy-specific initialization logic out of the training script.
Args:
config: Sample weighting configuration, or None to disable weighting.
policy: The policy being trained (used to extract chunk_size, etc.)
device: Device to place weight tensors on.
dataset_root: Local path to dataset root (for auto-detecting progress_path).
dataset_repo_id: HuggingFace repo ID (for auto-detecting progress_path).
"""
if config is None:
return None
if config.type == "rabc":
return _make_rabc_weighter(config, policy, device, dataset_root, dataset_repo_id)
if config.type == "uniform":
# No-op weighter that returns uniform weights
return UniformWeighter(device=device)
raise ValueError(f"Unknown sample weighting type: '{config.type}'. Supported types: 'rabc', 'uniform'")
def _make_rabc_weighter(
config: SampleWeightingConfig,
policy: PreTrainedPolicy,
device: torch.device,
dataset_root: str | None = None,
dataset_repo_id: str | None = None,
) -> SampleWeighter:
"""Create RABC weighter with policy-specific initialization.
Args:
config: Sample weighting configuration.
policy: The policy being trained (used to extract chunk_size).
device: Device to place weight tensors on.
dataset_root: Local path to dataset root (for auto-detecting progress_path).
dataset_repo_id: HuggingFace repo ID (for auto-detecting progress_path).
"""
# Import here to avoid circular imports and keep RABC code in SARM module
from lerobot.rewards.sarm.rabc import RABCWeights
# Extract chunk_size from policy config
chunk_size = getattr(policy.config, "chunk_size", None)
if chunk_size is None:
raise ValueError(
"RABC sample weighting requires a policy with 'chunk_size' in its config. "
"This is typically set for action-chunking policies like ACT, Diffusion, PI0, etc."
)
# Determine progress_path: use explicit config or auto-detect from dataset
progress_path = config.progress_path
if progress_path is None:
if dataset_root:
progress_path = str(Path(dataset_root) / "sarm_progress.parquet")
elif dataset_repo_id:
progress_path = f"hf://datasets/{dataset_repo_id}/sarm_progress.parquet"
else:
raise ValueError(
"RABC sample weighting requires 'progress_path' to be set, "
"or dataset_root/dataset_repo_id for auto-detection. "
"Generate progress values using: "
"python -m lerobot.rewards.sarm.compute_rabc_weights --help"
)
return RABCWeights(
progress_path=progress_path,
chunk_size=chunk_size,
head_mode=config.head_mode,
kappa=config.kappa,
epsilon=config.epsilon,
device=device,
**config.extra_params,
)
class UniformWeighter(SampleWeighter):
"""
No-op sample weighter that returns uniform weights.
Useful as a baseline or when you want to disable weighting without
changing the training code structure.
Note:
Batch size is determined by looking for tensor values in the batch
dictionary. The method checks common keys like "action", "index",
and "observation.state" first, then falls back to scanning all values.
"""
def __init__(self, device: torch.device):
self.device = device
def compute_batch_weights(self, batch: dict) -> tuple[torch.Tensor, dict]:
"""Return uniform weights (all ones)."""
batch_size = self._determine_batch_size(batch)
weights = torch.ones(batch_size, device=self.device)
stats = {"mean_weight": 1.0, "type": "uniform"}
return weights, stats
def _determine_batch_size(self, batch: dict) -> int:
"""
Determine batch size from the batch dictionary.
Checks common keys first, then scans all values for tensors.
Args:
batch: Training batch dictionary.
"""
if not batch:
raise ValueError("Cannot determine batch size from empty batch")
# Check common keys first
for key in ["action", "index", "observation.state"]:
if key in batch and isinstance(batch[key], torch.Tensor):
return batch[key].shape[0]
# Scan all values for any tensor
for value in batch.values():
if isinstance(value, torch.Tensor) and value.ndim >= 1:
return value.shape[0]
# Last resort: return 1 (this handles non-tensor batches)
return 1
def get_stats(self) -> dict:
"""Return empty stats for uniform weighting."""
return {"type": "uniform"}
+72
View File
@@ -63,10 +63,56 @@ def _is_scalar(x):
)
def _derive_depth_obs_ranges(
features: dict[str, dict] | None,
) -> dict[str, tuple[float, float] | None]:
"""Map observation keys of depth features to their ``(depth_min, depth_max)`` range.
A feature is considered a depth map when its ``info`` dict carries
``video.is_depth_map=True`` (the marker set by ``hw_to_dataset_features``
and persisted in ``info.json``). For each such feature, we record both
the fully-namespaced dataset key (e.g. ``observation.depth.front``) and
the corresponding raw observation key forms the robot is likely to emit
(``front`` and ``front_depth``) so a single membership check covers all
call sites.
The mapped value is the ``(depth_min, depth_max)`` range stored on the
feature (matching the quantization range used at encoding time), or
``None`` when the metadata doesn't expose a range — in which case the
caller should let Rerun auto-normalize. Anchoring the colormap to a
fixed range avoids per-frame re-normalization, which otherwise looks
like flicker on near-static scenes.
"""
ranges: dict[str, tuple[float, float] | None] = {}
if not features:
return ranges
depth_prefix = f"{OBS_STR}.depth."
for fk, fv in features.items():
info = fv.get("info") if isinstance(fv, dict) else None
if not isinstance(info, dict) or not info.get("video.is_depth_map", False):
continue
depth_min = info.get("video.depth_min")
depth_max = info.get("video.depth_max")
rng: tuple[float, float] | None = None
if (
isinstance(depth_min, (int, float))
and isinstance(depth_max, (int, float))
and depth_max > depth_min
):
rng = (float(depth_min), float(depth_max))
ranges[fk] = rng
if fk.startswith(depth_prefix):
bare = fk[len(depth_prefix) :]
ranges[bare] = rng
ranges[f"{bare}_depth"] = rng
return ranges
def log_rerun_data(
observation: RobotObservation | None = None,
action: RobotAction | None = None,
compress_images: bool = False,
features: dict[str, dict] | None = None,
) -> None:
"""
Logs observation and action data to Rerun for real-time visualization.
@@ -76,6 +122,13 @@ def log_rerun_data(
- Scalars values (floats, ints) are logged as `rr.Scalars`.
- 3D NumPy arrays that resemble images (e.g., with 1, 3, or 4 channels first) are transposed
from CHW to HWC format, (optionally) compressed to JPEG and logged as `rr.Image` or `rr.EncodedImage`.
- 2D NumPy arrays whose key matches a depth feature in ``features`` (i.e. carrying
``video.is_depth_map=True``) are logged as ``rr.DepthImage`` with the Viridis
colormap and ``meter=1.0`` (depth values are expected in metric meters). When
the feature exposes ``video.depth_min`` / ``video.depth_max`` (the encoder
quantization range, persisted in ``info.json``), the colormap is anchored to
that range via ``depth_range`` to keep the visualization stable across frames.
Depth images are never JPEG-compressed regardless of ``compress_images``.
- 1D NumPy arrays are logged as a series of individual scalars, with each element indexed.
- Other multi-dimensional arrays are flattened and logged as individual scalars.
@@ -85,11 +138,16 @@ def log_rerun_data(
observation: An optional dictionary containing observation data to log.
action: An optional dictionary containing action data to log.
compress_images: Whether to compress images before logging to save bandwidth & memory in exchange for cpu and quality.
features: Optional dataset feature spec (e.g. ``LeRobotDataset.features``). When
provided, observation entries matching a depth-map feature are rendered with
``rr.DepthImage`` instead of the generic ``rr.Image`` path.
"""
require_package("rerun-sdk", extra="viz", import_name="rerun")
import rerun as rr
depth_obs_ranges = _derive_depth_obs_ranges(features)
if observation:
for k, v in observation.items():
if v is None:
@@ -100,6 +158,20 @@ def log_rerun_data(
rr.log(key, rr.Scalars(float(v)))
elif isinstance(v, np.ndarray):
arr = v
is_depth = bool(depth_obs_ranges) and (k in depth_obs_ranges or key in depth_obs_ranges)
if is_depth and arr.ndim == 2:
# Viridis-colormapped DepthImage; never JPEG-compress (lossy on float metric depth).
# Anchor the colormap to the encoder range when available, so the
# visualization doesn't flicker as per-frame min/max drift.
depth_range = depth_obs_ranges.get(k) or depth_obs_ranges.get(key)
depth_kwargs: dict = {
"meter": 1.0,
"colormap": rr.components.Colormap.Viridis,
}
if depth_range is not None:
depth_kwargs["depth_range"] = depth_range
rr.log(key, rr.DepthImage(arr, **depth_kwargs), static=True)
continue
# Convert CHW -> HWC when needed
if arr.ndim == 3 and arr.shape[0] in (1, 3, 4) and arr.shape[-1] not in (1, 3, 4):
arr = np.transpose(arr, (1, 2, 0))
@@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:c2b8f8532c7a0b776de5e536b8b54e30b1a0c2e3d5cc25a2d86fe43e40ae5e8c
oid sha256:8a31653c11eccdd4d80fd3f6a351cd54c49b8a48db1f7e9faf38fddd7900a09f
size 515400
@@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:224b5fa4828aa88171b68c036e8919c1eae563e2113f03b6461eadf5bf8525a6
oid sha256:75bf051698b37dcd7517ec8025a896ab5a0551a6dde5f89d0a3d5d50966e83e6
size 31672
@@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:016d2fa8fe5f58017dfd46f4632fdc19dfd751e32a2c7cde2077c6f95546d6bd
oid sha256:88e10930a10041d50f2cf369e6813ac14618d13dad1c21bdde1ac7798611c6ba
size 68
@@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:eca0d87a699620e4fec7e68539b0be91e4cc933f6bf12032da52c182ab6f38cf
oid sha256:89833a5ccdb7d85c83f717ff8ec68b8e822005cb8803899acaae88c578e2e3ae
size 31672
+25
View File
@@ -202,6 +202,31 @@ def test_read_latest_too_old():
_ = camera.read_latest(max_age_ms=0) # immediately too old
def test_async_read_depth_without_use_depth_raises():
"""``async_read_depth`` must reject cameras configured without ``use_depth=True``."""
config = RealSenseCameraConfig(serial_number_or_name="042", warmup_s=0)
with RealSenseCamera(config) as camera, pytest.raises(RuntimeError, match="use_depth=False"):
_ = camera.async_read_depth()
def test_read_latest_depth_without_use_depth_raises():
"""``read_latest_depth`` must reject cameras configured without ``use_depth=True``."""
config = RealSenseCameraConfig(serial_number_or_name="042", warmup_s=0)
with RealSenseCamera(config) as camera, pytest.raises(RuntimeError, match="use_depth=False"):
_ = camera.read_latest_depth()
def test_depth_to_meters_uses_depth_scale():
"""``_depth_to_meters`` must scale uint16 raw depth into float32 metric meters."""
config = RealSenseCameraConfig(serial_number_or_name="042", warmup_s=0)
camera = RealSenseCamera(config)
camera.depth_scale = 0.001 # typical D-series scale (1 mm/unit)
raw = np.array([[0, 1000, 2500], [4095, 65535, 0]], dtype=np.uint16)
meters = camera._depth_to_meters(raw)
assert meters.dtype == np.float32
np.testing.assert_allclose(meters, raw.astype(np.float32) * 0.001)
@pytest.mark.parametrize(
"rotation",
[
+3 -3
View File
@@ -113,7 +113,7 @@ def assert_metadata_consistency(aggr_ds, ds_0, ds_1):
"""Test that metadata is correctly aggregated."""
# Test basic info
assert aggr_ds.fps == ds_0.fps == ds_1.fps, "FPS should be the same across all datasets"
assert aggr_ds.meta.info["robot_type"] == ds_0.meta.info["robot_type"] == ds_1.meta.info["robot_type"], (
assert aggr_ds.meta.info.robot_type == ds_0.meta.info.robot_type == ds_1.meta.info.robot_type, (
"Robot type should be the same"
)
@@ -153,8 +153,8 @@ def assert_video_frames_integrity(aggr_ds, ds_0, ds_1):
video_keys = list(
filter(
lambda key: aggr_ds.meta.info["features"][key]["dtype"] == "video",
aggr_ds.meta.info["features"].keys(),
lambda key: aggr_ds.meta.info.features[key]["dtype"] == "video",
aggr_ds.meta.info.features.keys(),
)
)
+31 -1
View File
@@ -142,6 +142,36 @@ def test_create_without_videos_has_no_video_path(tmp_path):
assert meta.video_keys == []
def test_depth_keys_property_filters_by_marker(tmp_path):
"""``depth_keys`` selects only video features carrying ``video.is_depth_map=True``."""
features = {
**SIMPLE_FEATURES,
"observation.images.cam": {
"dtype": "video",
"shape": (64, 96, 3),
"names": ["height", "width", "channels"],
"info": None,
},
"observation.depth.cam": {
"dtype": "video",
"shape": (64, 96),
"names": ["height", "width"],
"info": {"video.is_depth_map": True},
},
}
meta = LeRobotDatasetMetadata.create(
repo_id="test/depth_keys", fps=DEFAULT_FPS, features=features, root=tmp_path / "depth_keys"
)
assert set(meta.video_keys) == {"observation.images.cam", "observation.depth.cam"}
assert meta.depth_keys == ["observation.depth.cam"]
def test_depth_keys_empty_when_no_marker(tmp_path):
meta = LeRobotDatasetMetadata.create(
repo_id="test/no_depth", fps=DEFAULT_FPS, features=VIDEO_FEATURES, root=tmp_path / "no_depth"
)
assert meta.depth_keys == []
def test_create_raises_on_existing_directory(tmp_path):
"""create() raises if root directory already exists."""
root = tmp_path / "existing"
@@ -161,7 +191,7 @@ def test_init_loads_existing_metadata(tmp_path, lerobot_dataset_metadata_factory
assert meta.total_episodes == 3
assert meta.total_frames == 150
assert meta.fps == info["fps"]
assert meta.fps == info.fps
# ── Property accessors ───────────────────────────────────────────────

Some files were not shown because too many files have changed in this diff Show More