mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-28 06:59:44 +00:00
2686450d68
Subtask prompt (``module_1_subtasks.txt``): - Lock the verb vocabulary to composite atomic actions (``pick up``, ``put``/``place``, ``push``/``pull``, ``turn``, ``press``, ``open``/ ``close``, ``pour``, ``insert``, ``go to``). - Add an explicit ``Forbidden ultra-fine splits`` block instructing the VLM to fold ``move to X`` / ``reach for X`` / ``grasp X`` / ``lift X`` / ``release X`` into the parent composite. Previous examples actively encouraged the over-segmentation pattern. - Rewrite the Good/Bad examples around the composite contract. Job config (``examples/annotations/run_hf_job.py``): - Point at ``pepijn223/robocasa_smoke_2atomic_v3`` on ``h200x4``. - ``--vlm.camera_key=robot0_agentview_left`` (real key for the dataset; the prior ``observation.images.wrist`` did not exist and would have silenced the VQA module). - ``--vlm.serve_command`` ``--max-model-len 131072`` (4x): keeps 90 s @ 1 Hz episode video blocks under context even at full Qwen vision resolution. On 1x H200 (144 GB) the 35B-FP8 model has plenty of room for the bigger KV cache. - ``--vocabulary.enabled=false`` — heterogeneous dataset, no benefit from a single canonical vocabulary. - ``--plan.derive_task_from_video=off``, ``--plan.n_task_rephrasings=0`` — reuse the dataset's own ``episode_task`` strings as-is. - ``--plan.min_subtask_seconds=3.0``, ``--plan.plan_max_steps=6`` — give the new composite-action rules room to land (1.5 s floor was too small to host a full grasp-or-place composite). - ``--vqa.vqa_emission_hz=3.0`` — denser VQA grounding. - Timeout 24h, episode_parallelism=64, client_concurrency=256 to scale to the 25k-trajectory regime when the same recipe is pointed at a larger dataset. Co-authored-by: Cursor <cursoragent@cursor.com>