Replace the earlier wrapper (which depended on robocasa.scripts.download
+ dataset_registry) with a self-contained pipeline that:
* downloads each task tarball directly from Box via box_links_ds.json
* converts v2.1 -> v3.0 in place using convert_dataset_v21_to_v30
* standardizes camera keys under observation.images.robot0_* and
flattens observation.state by concatenating base/EE/gripper subkeys
when the source dataset stores them separately
* builds per-rank unified shards then aggregates into one dataset
Filter: composite_seen task-set restricts discovery to the 16 multi-step
target tasks (DeliverStraw, GetToastedBread, ..., WashLettuce). Use
--task-set=all to keep every discovered task in the split/source slice;
--tasks=... overrides for arbitrary subsets.
Defaults sized for hopper-cpu @ 128 cores: 16 workers x 8 cpus-per-task.
Adapted from a battle-tested port_robocasa.py reference shared by the
user; the only semantic addition is the task-set filter.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Parallel variant of build_robocasa_composite_seen.py modeled after the
existing slurm_port_shards.py / slurm_aggregate_shards.py pattern.
Two-phase datatrove pipeline:
* Phase 1 DOWNLOAD: tasks=16 (one per RoboCasa composite_seen task),
each worker downloads its assigned tar via RoboCasa's own
download_datasets helper. Network-bound, idempotent.
* Phase 2 AGGREGATE: tasks=1, single worker calls aggregate_datasets
over the 16 extracted directories. Submitted with depends=phase1 so
SLURM only releases it once all 16 downloads succeed.
Reuses the COMPOSITE_SEEN_TASKS list and per-task download/resolve
helpers from the single-machine script via aliased imports — single
source of truth for 'what does it mean to download a composite_seen
task'.
Local (--slurm 0) mode runs the two phases sequentially in-process for
debugging on a workstation.
Usage on SLURM:
uv run python examples/port_datasets/slurm_build_robocasa_composite_seen.py \
--output-dir=/scratch/${USER}/robocasa_composite_seen \
--hub-repo-id=${HF_USER}/robocasa_composite_seen \
--logs-dir=/scratch/${USER}/logs/robocasa \
--partition=cpu --push-to-hub
Prereq: uv sync --extra annotations (pulls datatrove)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>