Files
Pepijn 82dffde7fa fix(ci): speed up multi-task benchmark evals (parallelize + cap VLABench steps) (#3529)
* fix(ci): run multi-task benchmark evals 5-at-a-time in parallel

The eval script supports running tasks concurrently via a
ThreadPoolExecutor (env.max_parallel_tasks). Apply it to the four
multi-task benchmark CI jobs (RoboTwin, RoboCasa, RoboMME, LIBERO-plus
— 8-10 tasks/task_ids each) so they finish in ~2 waves of 5 instead of
running sequentially. Single-task jobs (Libero, MetaWorld, RoboCerebra)
are unchanged.

* fix(ci): cap VLABench smoke eval at 50 steps per task

VLABench's default episode_length is 500 steps; with 10 tasks at ~1 it/s
the smoke eval took ~80 minutes of rollouts on top of the image build.
The eval is a pipeline smoke test (running_success_rate stays at 0% on
this short rollout anyway), so we don't need full episodes — cap each
task at 50 steps to bring total rollout time down ~10x.

* fix(ci): run VLABench tasks 5-at-a-time in parallel

The eval script already supports running multiple tasks concurrently via
a ThreadPoolExecutor (env.max_parallel_tasks). Set it to 5 so the 10
VLABench tasks finish in ~2 waves instead of running sequentially.
2026-05-07 13:37:16 +02:00
..