lerobot

admin/lerobot

Fork 0

mirror of https://github.com/huggingface/lerobot.git synced 2026-05-28 15:09:51 +00:00

Commit Graph

Author	SHA1	Message	Date
pepijn	3fdfcb912a	examples(port_datasets): generalize RoboCasa builder + add smoke script - Add ATOMIC_TASKS, COMPOSITE_UNSEEN_TASKS and four new --task-set keys (atomic, composite_unseen, composite_all, composite_atomic) so the same builder produces the 50-task target benchmark or the 300-task Human300 pretraining slice (via --split=pretrain --task-set=all) without duplicating logic. - Stop hardcoding the composite_seen tag on the HF push; tags are now derived from --split / --source / --task-set so atomic, composite_all, and pretrain runs land with accurate metadata. - Refresh module docstring to match the broader scope. - Add scripts/build_robocasa_smoke.sh: 2-atomic-task smoke dataset (~1k episodes, ~131k frames) for fast end-to-end training validation before kicking off Human300-scale runs.	2026-05-25 14:54:00 +00:00
Pepijn	67bdf4690e	examples(port_datasets): rewrite RoboCasa composite_seen builder Replace the earlier wrapper (which depended on robocasa.scripts.download + dataset_registry) with a self-contained pipeline that: * downloads each task tarball directly from Box via box_links_ds.json * converts v2.1 -> v3.0 in place using convert_dataset_v21_to_v30 * standardizes camera keys under observation.images.robot0_* and flattens observation.state by concatenating base/EE/gripper subkeys when the source dataset stores them separately * builds per-rank unified shards then aggregates into one dataset Filter: composite_seen task-set restricts discovery to the 16 multi-step target tasks (DeliverStraw, GetToastedBread, ..., WashLettuce). Use --task-set=all to keep every discovered task in the split/source slice; --tasks=... overrides for arbitrary subsets. Defaults sized for hopper-cpu @ 128 cores: 16 workers x 8 cpus-per-task. Adapted from a battle-tested port_robocasa.py reference shared by the user; the only semantic addition is the task-set filter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 14:27:42 +02:00
Pepijn	a088c10c80	examples(port_datasets): SLURM+datatrove RoboCasa composite_seen build Parallel variant of build_robocasa_composite_seen.py modeled after the existing slurm_port_shards.py / slurm_aggregate_shards.py pattern. Two-phase datatrove pipeline: * Phase 1 DOWNLOAD: tasks=16 (one per RoboCasa composite_seen task), each worker downloads its assigned tar via RoboCasa's own download_datasets helper. Network-bound, idempotent. * Phase 2 AGGREGATE: tasks=1, single worker calls aggregate_datasets over the 16 extracted directories. Submitted with depends=phase1 so SLURM only releases it once all 16 downloads succeed. Reuses the COMPOSITE_SEEN_TASKS list and per-task download/resolve helpers from the single-machine script via aliased imports — single source of truth for 'what does it mean to download a composite_seen task'. Local (--slurm 0) mode runs the two phases sequentially in-process for debugging on a workstation. Usage on SLURM: uv run python examples/port_datasets/slurm_build_robocasa_composite_seen.py \ --output-dir=/scratch/${USER}/robocasa_composite_seen \ --hub-repo-id=${HF_USER}/robocasa_composite_seen \ --logs-dir=/scratch/${USER}/logs/robocasa \ --partition=cpu --push-to-hub Prereq: uv sync --extra annotations (pulls datatrove) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 14:10:05 +02:00

Author

SHA1

Message

Date

pepijn

3fdfcb912a

examples(port_datasets): generalize RoboCasa builder + add smoke script

- Add ATOMIC_TASKS, COMPOSITE_UNSEEN_TASKS and four new --task-set keys
  (atomic, composite_unseen, composite_all, composite_atomic) so the same
  builder produces the 50-task target benchmark or the 300-task Human300
  pretraining slice (via --split=pretrain --task-set=all) without
  duplicating logic.
- Stop hardcoding the composite_seen tag on the HF push; tags are now
  derived from --split / --source / --task-set so atomic, composite_all,
  and pretrain runs land with accurate metadata.
- Refresh module docstring to match the broader scope.
- Add scripts/build_robocasa_smoke.sh: 2-atomic-task smoke dataset
  (~1k episodes, ~131k frames) for fast end-to-end training validation
  before kicking off Human300-scale runs.

2026-05-25 14:54:00 +00:00

Pepijn

67bdf4690e

examples(port_datasets): rewrite RoboCasa composite_seen builder

Replace the earlier wrapper (which depended on robocasa.scripts.download
+ dataset_registry) with a self-contained pipeline that:

* downloads each task tarball directly from Box via box_links_ds.json
* converts v2.1 -> v3.0 in place using convert_dataset_v21_to_v30
* standardizes camera keys under observation.images.robot0_* and
  flattens observation.state by concatenating base/EE/gripper subkeys
  when the source dataset stores them separately
* builds per-rank unified shards then aggregates into one dataset

Filter: composite_seen task-set restricts discovery to the 16 multi-step
target tasks (DeliverStraw, GetToastedBread, ..., WashLettuce). Use
--task-set=all to keep every discovered task in the split/source slice;
--tasks=... overrides for arbitrary subsets.

Defaults sized for hopper-cpu @ 128 cores: 16 workers x 8 cpus-per-task.

Adapted from a battle-tested port_robocasa.py reference shared by the
user; the only semantic addition is the task-set filter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-25 14:27:42 +02:00

Pepijn

a088c10c80

examples(port_datasets): SLURM+datatrove RoboCasa composite_seen build

Parallel variant of build_robocasa_composite_seen.py modeled after the
existing slurm_port_shards.py / slurm_aggregate_shards.py pattern.

Two-phase datatrove pipeline:
  * Phase 1 DOWNLOAD: tasks=16 (one per RoboCasa composite_seen task),
    each worker downloads its assigned tar via RoboCasa's own
    download_datasets helper. Network-bound, idempotent.
  * Phase 2 AGGREGATE: tasks=1, single worker calls aggregate_datasets
    over the 16 extracted directories. Submitted with depends=phase1 so
    SLURM only releases it once all 16 downloads succeed.

Reuses the COMPOSITE_SEEN_TASKS list and per-task download/resolve
helpers from the single-machine script via aliased imports — single
source of truth for 'what does it mean to download a composite_seen
task'.

Local (--slurm 0) mode runs the two phases sequentially in-process for
debugging on a workstation.

Usage on SLURM:
    uv run python examples/port_datasets/slurm_build_robocasa_composite_seen.py \
        --output-dir=/scratch/${USER}/robocasa_composite_seen \
        --hub-repo-id=${HF_USER}/robocasa_composite_seen \
        --logs-dir=/scratch/${USER}/logs/robocasa \
        --partition=cpu --push-to-hub

Prereq: uv sync --extra annotations  (pulls datatrove)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-25 14:10:05 +02:00

3 Commits