examples(port_datasets): rewrite RoboCasa composite_seen builder

Replace the earlier wrapper (which depended on robocasa.scripts.download
+ dataset_registry) with a self-contained pipeline that:

* downloads each task tarball directly from Box via box_links_ds.json
* converts v2.1 -> v3.0 in place using convert_dataset_v21_to_v30
* standardizes camera keys under observation.images.robot0_* and
  flattens observation.state by concatenating base/EE/gripper subkeys
  when the source dataset stores them separately
* builds per-rank unified shards then aggregates into one dataset

Filter: composite_seen task-set restricts discovery to the 16 multi-step
target tasks (DeliverStraw, GetToastedBread, ..., WashLettuce). Use
--task-set=all to keep every discovered task in the split/source slice;
--tasks=... overrides for arbitrary subsets.

Defaults sized for hopper-cpu @ 128 cores: 16 workers x 8 cpus-per-task.

Adapted from a battle-tested port_robocasa.py reference shared by the
user; the only semantic addition is the task-set filter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Pepijn
2026-05-25 14:27:42 +02:00
parent 8085feab6e
commit 67bdf4690e
File diff suppressed because it is too large Load Diff