- Add ATOMIC_TASKS, COMPOSITE_UNSEEN_TASKS and four new --task-set keys
(atomic, composite_unseen, composite_all, composite_atomic) so the same
builder produces the 50-task target benchmark or the 300-task Human300
pretraining slice (via --split=pretrain --task-set=all) without
duplicating logic.
- Stop hardcoding the composite_seen tag on the HF push; tags are now
derived from --split / --source / --task-set so atomic, composite_all,
and pretrain runs land with accurate metadata.
- Refresh module docstring to match the broader scope.
- Add scripts/build_robocasa_smoke.sh: 2-atomic-task smoke dataset
(~1k episodes, ~131k frames) for fast end-to-end training validation
before kicking off Human300-scale runs.
Replace the earlier wrapper (which depended on robocasa.scripts.download
+ dataset_registry) with a self-contained pipeline that:
* downloads each task tarball directly from Box via box_links_ds.json
* converts v2.1 -> v3.0 in place using convert_dataset_v21_to_v30
* standardizes camera keys under observation.images.robot0_* and
flattens observation.state by concatenating base/EE/gripper subkeys
when the source dataset stores them separately
* builds per-rank unified shards then aggregates into one dataset
Filter: composite_seen task-set restricts discovery to the 16 multi-step
target tasks (DeliverStraw, GetToastedBread, ..., WashLettuce). Use
--task-set=all to keep every discovered task in the split/source slice;
--tasks=... overrides for arbitrary subsets.
Defaults sized for hopper-cpu @ 128 cores: 16 workers x 8 cpus-per-task.
Adapted from a battle-tested port_robocasa.py reference shared by the
user; the only semantic addition is the task-set filter.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Parallel variant of build_robocasa_composite_seen.py modeled after the
existing slurm_port_shards.py / slurm_aggregate_shards.py pattern.
Two-phase datatrove pipeline:
* Phase 1 DOWNLOAD: tasks=16 (one per RoboCasa composite_seen task),
each worker downloads its assigned tar via RoboCasa's own
download_datasets helper. Network-bound, idempotent.
* Phase 2 AGGREGATE: tasks=1, single worker calls aggregate_datasets
over the 16 extracted directories. Submitted with depends=phase1 so
SLURM only releases it once all 16 downloads succeed.
Reuses the COMPOSITE_SEEN_TASKS list and per-task download/resolve
helpers from the single-machine script via aliased imports — single
source of truth for 'what does it mean to download a composite_seen
task'.
Local (--slurm 0) mode runs the two phases sequentially in-process for
debugging on a workstation.
Usage on SLURM:
uv run python examples/port_datasets/slurm_build_robocasa_composite_seen.py \
--output-dir=/scratch/${USER}/robocasa_composite_seen \
--hub-repo-id=${HF_USER}/robocasa_composite_seen \
--logs-dir=/scratch/${USER}/logs/robocasa \
--partition=cpu --push-to-hub
Prereq: uv sync --extra annotations (pulls datatrove)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>