mirror of
https://github.com/huggingface/lerobot.git
synced 2026-06-17 16:27:04 +00:00
annotate: address review feedback — bug fixes, docs/code drift, naming, cleanup
Bugs
* validator: don't re-raise on unknown style. The second column_for_style
lookup (used to route persistent vs event) now sits in try/except so an
unknown style is recorded by _check_column_routing and skipped instead
of crashing the whole validation pass.
* general_vqa._target_cameras: when restrict_to_default_camera is set but
the configured camera_key isn't one the provider exposes, warn and fall
back to all cameras instead of returning a phantom key that KeyErrors
deep in frame decode.
* interjections: clamp interjection timestamps to frame_timestamps[0]
rather than a hardcoded 0.0 (datasets can start at non-zero t).
Docs / code drift
* annotation_pipeline.mdx: drop the phantom 'vocabulary discovery / phase
0 / --vocabulary.* / canonical_vocabulary.json' section (none of it
exists); describe the real describe->segment + coverage-stitch flow.
Soften the src/lerobot/tools/ + TOOL_REGISTRY reference to 'not part of
this PR' (matches tools.mdx, which already marks the runtime layer as
not-yet-implemented). Fix the --push_to_hub/--new_repo_id wording. Note
the default is now a single h200. Add a 'Contributing new modules'
section inviting module / prompt / quality contributions.
* executor docstring: six phases, no phantom phase 0.
run_hf_job.py
* add the Apache 2.0 license header (was flagged repeatedly).
* default to a single GPU: flavor=h200, parallel_servers=1, num_gpus=1
(scale to h200x4 noted in the docstring).
* pin the install to @main instead of the feature branch (won't break
after merge).
Naming / cleanup
* rename dest_repo_id -> new_repo_id across config / script / example /
test to match the LeRobot dataset edit tools.
* rename prompt templates module_N_*.txt -> descriptive (plan_*,
interjections_*, vqa.txt) and update every load_prompt() call.
* remove dead _messages_to_prompt (used only by the removed in-process
backends).
* declare _warned_decode_fail (frames) and _warned_no_camera (vqa) as
real init=False dataclass fields instead of getattr monkey-patches.
* scope bandit B607 to the two ffmpeg subprocess.run sites via
'# nosec B607' and drop it from the global skip list.
Tests
* fix stale canned-VLM markers ('ONE realistic interruption' ->
'compact interjection', 'Update the memory' -> 'compressed semantic
memory') and drop the dead 'concise hierarchical PLAN' plan responders
(plan generation is deterministic now) in run_e2e_smoke,
test_pipeline_recipe_render, test_modules.
* run_e2e_smoke now asserts interjection + speech rows are produced so a
stale marker can't silently pass again.
* drop remaining 'PR 1' / 'PR 2' references from test comments / names.
Verified: tests/annotations + tests/datasets/test_language +
tests/scripts/test_lerobot_annotate (31 passed); make-style E2E smoke
(interjections=1 speech_atoms=2); pre-commit (ruff, mypy, bandit,
prettier) clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -7,8 +7,7 @@
|
||||
|
||||
## What the pipeline produces
|
||||
|
||||
A vocabulary-discovery phase derives a small canonical wording, then three
|
||||
modules write into a per-episode staging tree, then a single writer
|
||||
Three modules write into a per-episode staging tree, then a single writer
|
||||
rewrites the data shards in place:
|
||||
|
||||
| Style / atom | Column | Module |
|
||||
@@ -21,20 +20,15 @@ rewrites the data shards in place:
|
||||
| speech tool-call atom (`style=null`, `say`) | `language_events` | `interjections` |
|
||||
| `vqa` (user / assistant pair) | `language_events` | `vqa` |
|
||||
|
||||
The `plan` module is constrained to a **canonical vocabulary** discovered
|
||||
once per dataset by the `vocabulary` module (phase 0). It watches a few
|
||||
sample episode videos (`--vocabulary.sample_episodes`, default `3`) and
|
||||
asks the VLM to derive a small set of imperative subtask labels and
|
||||
first-person memory milestones that recur across the demos. The VLM
|
||||
picks the right number of entries itself based on what it sees in the
|
||||
clips — short pick-and-place demos get ~6 subtask labels, longer
|
||||
multi-step recipes get more. The result lands at
|
||||
`meta/canonical_vocabulary.json` (human-readable / hand-editable) and
|
||||
is reused on every subsequent run. The `plan` module then constrains
|
||||
both subtask + memory generation to those exact strings — the
|
||||
downstream low-level policy sees a small, repeatable target
|
||||
distribution instead of thousands of LLM paraphrases. Disable with
|
||||
`--vocabulary.enabled=False` to fall back to free-form generation.
|
||||
The `plan` module generates subtasks per episode with a **describe → segment**
|
||||
grounding flow: a first pass narrates only what is visible in the chosen
|
||||
camera, and its description is fed into a second pass that segments the
|
||||
episode into consecutive atomic subtasks. The resulting spans are then
|
||||
deterministically stitched into a contiguous full-episode cover so every
|
||||
frame has exactly one active subtask. See
|
||||
[`run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
|
||||
for the production flag set (single camera, embedded frames, windowed
|
||||
subtask generation).
|
||||
|
||||
The writer does **not** add a `tools` column to the parquet — the tool
|
||||
catalog lives at `meta/info.json["tools"]` instead (see
|
||||
@@ -44,9 +38,11 @@ user pre-declared.
|
||||
|
||||
If you want to declare additional tools for a dataset before annotation
|
||||
runs, edit `meta/info.json["tools"]` directly — the pipeline preserves
|
||||
anything already there. Implementations of those tools live under
|
||||
`src/lerobot/tools/`; one file per tool, registered via
|
||||
`TOOL_REGISTRY`. See the [Tools](./tools) doc for the authoring guide.
|
||||
anything already there. That makes the tool visible to the chat template
|
||||
so the model can learn to _generate_ the call. The runtime layer that
|
||||
_executes_ a generated call (the `Tool` protocol / `TOOL_REGISTRY` under
|
||||
`src/lerobot/tools/`) is not part of this PR — see the
|
||||
[Tools](./tools) doc, which marks those pieces as not-yet-implemented.
|
||||
|
||||
## Running on Hugging Face Jobs
|
||||
|
||||
@@ -59,19 +55,33 @@ HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
|
||||
```
|
||||
|
||||
[`examples/annotations/run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
|
||||
spawns a multi-GPU `h200` job that:
|
||||
spawns a single-GPU `h200` job (scale up to `h200x4` for larger datasets) that:
|
||||
|
||||
1. installs the branch under test plus the annotation extras,
|
||||
2. boots one vLLM server per GPU (in the `vllm/vllm-openai` image) for the
|
||||
chosen model, which the pipeline drives over the OpenAI-compatible API,
|
||||
3. runs the `plan` / `interjections` / `vqa` modules across the dataset
|
||||
via `lerobot-annotate`,
|
||||
4. uploads the annotated dataset to `--push_to_hub`.
|
||||
4. with `--push_to_hub=true`, uploads the annotated dataset to
|
||||
`--new_repo_id` (or back to `--repo_id` in place when that is unset).
|
||||
|
||||
To target a different dataset, model, or hub repo, edit the `CMD` block
|
||||
inside the script — every flag in there maps directly onto a CLI flag of
|
||||
`lerobot-annotate` (see `lerobot-annotate --help` for the full list).
|
||||
|
||||
## Contributing new modules
|
||||
|
||||
The pipeline is built to be extended, and **contributions are very
|
||||
welcome** — whether that's a brand-new annotation module (e.g. a
|
||||
trajectory-trace or affordance module), a new prompt template, a better
|
||||
grounding flow, or quality improvements to the existing `plan` /
|
||||
`interjections` / `vqa` modules. Each module lives under
|
||||
`src/lerobot/annotations/steerable_pipeline/modules/`, shares the VLM
|
||||
client and keyframe cache, writes its raw output to the per-episode
|
||||
staging tree, and is wired into the executor as an independent phase.
|
||||
If you have an idea for a module or an improvement, open an issue or PR
|
||||
on [the repo](https://github.com/huggingface/lerobot).
|
||||
|
||||
## Style-to-recipe consumer mapping
|
||||
|
||||
The pipeline's outputs are designed to be consumed by recipes (see
|
||||
|
||||
@@ -1,21 +1,37 @@
|
||||
#!/usr/bin/env python
|
||||
|
||||
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Launch ``lerobot-annotate`` on a Hugging Face job (vllm + Qwen3.6-27B VLM).
|
||||
|
||||
Spawns one ``h200x4`` job that:
|
||||
Spawns one single-GPU ``h200`` job that:
|
||||
|
||||
1. installs this branch of ``lerobot`` plus the annotation extras,
|
||||
2. boots four vllm servers (one per GPU) with Qwen3.6-27B (dense VLM),
|
||||
1. installs ``lerobot`` plus the annotation extras,
|
||||
2. boots one vllm server with Qwen3.6-27B (dense VLM),
|
||||
3. runs the plan / interjections / vqa modules across the dataset
|
||||
in free-form mode (each episode generates its own subtasks +
|
||||
memory),
|
||||
4. uploads the annotated dataset to ``--dest_repo_id`` (when set)
|
||||
4. uploads the annotated dataset to ``--new_repo_id`` (when set)
|
||||
or back to ``--repo_id``.
|
||||
|
||||
Usage:
|
||||
|
||||
HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
|
||||
|
||||
Adjust ``CMD`` below to point at your own dataset / target hub repo.
|
||||
Adjust ``CMD`` (dataset, model, hub repo) and ``flavor`` below for your
|
||||
run. For larger datasets, scale to ``h200x4`` and raise
|
||||
``--vlm.parallel_servers`` / ``--vlm.num_gpus`` to match.
|
||||
"""
|
||||
|
||||
import os
|
||||
@@ -29,7 +45,7 @@ if not token:
|
||||
CMD = (
|
||||
"apt-get update -qq && apt-get install -y -qq git ffmpeg && "
|
||||
"pip install --no-deps "
|
||||
"'lerobot @ git+https://github.com/huggingface/lerobot.git@feat/language-annotation-pipeline' && "
|
||||
"'lerobot @ git+https://github.com/huggingface/lerobot.git@main' && "
|
||||
"pip install --upgrade-strategy only-if-needed "
|
||||
"datasets pyarrow av jsonlines draccus gymnasium torchcodec mergedeep pyyaml-include toml typing-inspect "
|
||||
"openai && "
|
||||
@@ -37,12 +53,12 @@ CMD = (
|
||||
"export VLLM_VIDEO_BACKEND=pyav && "
|
||||
"lerobot-annotate "
|
||||
"--repo_id=pepijn223/robocasa_pretrain_human300_v4 "
|
||||
"--dest_repo_id=pepijn223/robocasa_pretrain_human300_v4_annotated5 "
|
||||
"--new_repo_id=pepijn223/robocasa_pretrain_human300_v4_annotated5 "
|
||||
"--push_to_hub=true "
|
||||
"--vlm.backend=openai "
|
||||
"--vlm.model_id=Qwen/Qwen3.6-27B "
|
||||
"--vlm.parallel_servers=4 "
|
||||
"--vlm.num_gpus=4 "
|
||||
"--vlm.parallel_servers=1 "
|
||||
"--vlm.num_gpus=1 "
|
||||
'--vlm.serve_command="vllm serve Qwen/Qwen3.6-27B '
|
||||
"--tensor-parallel-size 1 --max-model-len 32768 "
|
||||
'--gpu-memory-utilization 0.8 --uvicorn-log-level warning --port {port}" '
|
||||
@@ -111,7 +127,7 @@ CMD = (
|
||||
job = run_job(
|
||||
image="vllm/vllm-openai:latest",
|
||||
command=["bash", "-c", CMD],
|
||||
flavor="h200x4",
|
||||
flavor="h200",
|
||||
secrets={"HF_TOKEN": token},
|
||||
timeout="2h",
|
||||
)
|
||||
|
||||
+1
-1
@@ -417,7 +417,7 @@ exclude_dirs = [
|
||||
"benchmarks",
|
||||
"src/lerobot/datasets/push_dataset_to_hub",
|
||||
]
|
||||
skips = ["B101", "B311", "B404", "B603", "B607", "B615"]
|
||||
skips = ["B101", "B311", "B404", "B603", "B615"]
|
||||
|
||||
[tool.typos]
|
||||
default.extend-ignore-re = [
|
||||
|
||||
@@ -370,13 +370,14 @@ class AnnotationPipelineConfig:
|
||||
|
||||
# Hub dataset id. Used as the download source when ``root`` is unset,
|
||||
# and as the destination repo when ``push_to_hub`` is enabled and
|
||||
# ``dest_repo_id`` is unset.
|
||||
# ``new_repo_id`` is unset.
|
||||
repo_id: str | None = None
|
||||
|
||||
# Optional separate Hub dataset id to push the annotated result to. When
|
||||
# unset, ``push_to_hub`` uploads back to ``repo_id`` (annotate in place);
|
||||
# when set, the source ``repo_id`` is left untouched.
|
||||
dest_repo_id: str | None = None
|
||||
# Optional separate Hub dataset id to push the annotated result to (named
|
||||
# ``new_repo_id`` to match the LeRobot dataset edit tools). When unset,
|
||||
# ``push_to_hub`` uploads back to ``repo_id`` (annotate in place); when
|
||||
# set, the source ``repo_id`` is left untouched.
|
||||
new_repo_id: str | None = None
|
||||
|
||||
root: Path | None = None
|
||||
|
||||
@@ -404,7 +405,7 @@ class AnnotationPipelineConfig:
|
||||
video_backend: str | None = None
|
||||
|
||||
# When True, upload the annotated dataset to the Hugging Face Hub:
|
||||
# to ``dest_repo_id`` if set, otherwise back to ``repo_id``. One of
|
||||
# to ``new_repo_id`` if set, otherwise back to ``repo_id``. One of
|
||||
# the two must be set for this to take effect.
|
||||
push_to_hub: bool = False
|
||||
push_private: bool = False
|
||||
|
||||
@@ -15,14 +15,8 @@
|
||||
# limitations under the License.
|
||||
"""In-process executor that runs the annotation phases.
|
||||
|
||||
The executor plans **seven phases** in the dependency order from the plan:
|
||||
The executor runs **six phases** in dependency order:
|
||||
|
||||
phase 0: vocabulary discovery — derive a small canonical vocabulary
|
||||
from the first few sample-episode videos (subtask labels +
|
||||
memory milestones) and persist it next to the dataset; the
|
||||
``plan`` module then constrains every per-episode generation
|
||||
to those strings, so the downstream policy sees a small,
|
||||
repeatable conditioning distribution
|
||||
phase 1: ``plan`` module (plan + subtasks + memory)
|
||||
phase 2: ``interjections`` module (interjections + speech)
|
||||
phase 3: ``plan`` plan-update pass — re-runs plan emission at every
|
||||
|
||||
@@ -146,6 +146,7 @@ class VideoFrameProvider:
|
||||
# ``ExecutorConfig.episode_parallelism``); guard the dict cache and the
|
||||
# one-shot warn flag against concurrent updates from worker threads.
|
||||
_lock: threading.Lock = field(default_factory=threading.Lock, init=False, repr=False)
|
||||
_warned_decode_fail: bool = field(default=False, init=False, repr=False)
|
||||
|
||||
def __post_init__(self) -> None:
|
||||
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata # noqa: PLC0415
|
||||
@@ -285,7 +286,9 @@ class VideoFrameProvider:
|
||||
str(out_path),
|
||||
]
|
||||
try:
|
||||
subprocess.run(cmd, check=True, timeout=300)
|
||||
# ffmpeg is invoked by name via PATH lookup (the standard way to
|
||||
# call the CLI); the arg list is fully controlled here, not shell.
|
||||
subprocess.run(cmd, check=True, timeout=300) # nosec B607
|
||||
except (subprocess.CalledProcessError, subprocess.TimeoutExpired, FileNotFoundError):
|
||||
return None
|
||||
return out_path if out_path.exists() and out_path.stat().st_size > 0 else None
|
||||
@@ -335,7 +338,7 @@ class VideoFrameProvider:
|
||||
# []) is debuggable from the job log instead of post-hoc parquet
|
||||
# inspection. Subsequent failures stay quiet.
|
||||
with self._lock:
|
||||
already_warned = getattr(self, "_warned_decode_fail", False)
|
||||
already_warned = self._warned_decode_fail
|
||||
if not already_warned:
|
||||
self._warned_decode_fail = True
|
||||
if not already_warned:
|
||||
@@ -382,7 +385,8 @@ def _decode_frames_ffmpeg(video_path: Path, timestamps: list[float]) -> list[Any
|
||||
|
||||
frames: list[Any] = []
|
||||
for ts in timestamps:
|
||||
proc = subprocess.run(
|
||||
# ffmpeg invoked by name via PATH lookup; fully-controlled arg list, no shell.
|
||||
proc = subprocess.run( # nosec B607
|
||||
[
|
||||
"ffmpeg",
|
||||
"-nostdin",
|
||||
|
||||
@@ -95,6 +95,7 @@ class GeneralVqaModule:
|
||||
config: VqaConfig
|
||||
seed: int = 1729
|
||||
frame_provider: FrameProvider = field(default_factory=null_provider)
|
||||
_warned_no_camera: bool = field(default=False, init=False, repr=False)
|
||||
|
||||
@property
|
||||
def enabled(self) -> bool:
|
||||
@@ -113,7 +114,7 @@ class GeneralVqaModule:
|
||||
# No camera available — emit nothing rather than producing
|
||||
# untagged rows that would fail validation. Surface a loud one-
|
||||
# time warning so this is never silently a no-op.
|
||||
if not getattr(self, "_warned_no_camera", False):
|
||||
if not self._warned_no_camera:
|
||||
logging.getLogger(__name__).warning(
|
||||
"vqa module found no cameras on the frame provider — "
|
||||
"every episode will emit zero VQA rows. Check that the "
|
||||
@@ -191,8 +192,17 @@ class GeneralVqaModule:
|
||||
default = getattr(self.frame_provider, "camera_key", None)
|
||||
if default and default in all_cameras:
|
||||
return [default]
|
||||
# ``restrict_to_default_camera`` is set but the configured default
|
||||
# isn't one the provider exposes. Returning it anyway would make
|
||||
# ``_decode`` raise a KeyError deep in frame extraction, so warn and
|
||||
# fall through to every available camera instead.
|
||||
if default:
|
||||
return [default]
|
||||
logging.getLogger(__name__).warning(
|
||||
"restrict_to_default_camera is set but camera_key=%r is not in the "
|
||||
"provider's cameras %s; grounding VQA on all available cameras instead.",
|
||||
default,
|
||||
all_cameras,
|
||||
)
|
||||
return all_cameras
|
||||
|
||||
def _build_messages(
|
||||
@@ -202,7 +212,7 @@ class GeneralVqaModule:
|
||||
frame_timestamp: float,
|
||||
camera_key: str,
|
||||
) -> list[dict[str, Any]]:
|
||||
prompt = load_prompt("module_3_vqa").format(
|
||||
prompt = load_prompt("vqa").format(
|
||||
episode_task=record.episode_task,
|
||||
question_type=question_type,
|
||||
)
|
||||
|
||||
@@ -85,7 +85,7 @@ class InterjectionsAndSpeechModule:
|
||||
return current
|
||||
|
||||
def _initial_speech(self, record: EpisodeRecord) -> str | None:
|
||||
prompt = load_prompt("module_2_initial_speech").format(
|
||||
prompt = load_prompt("interjections_initial_speech").format(
|
||||
episode_task=record.episode_task,
|
||||
)
|
||||
messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]
|
||||
@@ -147,7 +147,7 @@ class InterjectionsAndSpeechModule:
|
||||
# previous subtask and the start of the next one — same
|
||||
# conditioning the policy will see at training time.
|
||||
window_ts = self._window_timestamps(t_snap, record.frame_timestamps)
|
||||
prompt = load_prompt("module_2_interjection").format(
|
||||
prompt = load_prompt("interjections_interjection").format(
|
||||
episode_task=record.episode_task,
|
||||
prev_subtask=prev_subtask or "(starting from initial state)",
|
||||
next_subtask=next_subtask,
|
||||
@@ -198,11 +198,12 @@ class InterjectionsAndSpeechModule:
|
||||
# Center the window on the anchor so half lands before, half after.
|
||||
start_offset = -window / 2.0
|
||||
targets = [t_anchor + start_offset + step * i for i in range(n)]
|
||||
first_ts = float(frame_timestamps[0])
|
||||
last_ts = float(frame_timestamps[-1])
|
||||
snapped: list[float] = []
|
||||
seen: set[float] = set()
|
||||
for tgt in targets:
|
||||
clamped = min(last_ts, max(0.0, tgt))
|
||||
clamped = min(last_ts, max(first_ts, tgt))
|
||||
t = snap_to_frame(clamped, frame_timestamps)
|
||||
if t not in seen:
|
||||
seen.add(t)
|
||||
|
||||
@@ -285,14 +285,14 @@ class PlanSubtasksMemoryModule:
|
||||
|
||||
def _derive_task_from_video(self, record: EpisodeRecord) -> str | None:
|
||||
"""Ask the VLM "what is this video about" with no task hint at all."""
|
||||
text = self._vlm_field(self._video_message(record, load_prompt("module_1_video_task")), "task")
|
||||
text = self._vlm_field(self._video_message(record, load_prompt("plan_video_task")), "task")
|
||||
return text.strip() if isinstance(text, str) and text.strip() else None
|
||||
|
||||
def _generate_task_rephrasings(self, base_task: str, *, n: int) -> list[str]:
|
||||
"""Generate ``n`` text-only paraphrases of ``base_task``."""
|
||||
if n <= 0 or not base_task:
|
||||
return []
|
||||
prompt = load_prompt("module_1_task_rephrasings").format(base_task=base_task, n=n)
|
||||
prompt = load_prompt("plan_task_rephrasings").format(base_task=base_task, n=n)
|
||||
raw = self._vlm_field(self._text_message(prompt), "rephrasings")
|
||||
if not isinstance(raw, list):
|
||||
return []
|
||||
@@ -343,7 +343,7 @@ class PlanSubtasksMemoryModule:
|
||||
)
|
||||
return None
|
||||
|
||||
prompt = load_prompt("module_1_action_record").format(
|
||||
prompt = load_prompt("plan_action_record").format(
|
||||
episode_task=episode_task,
|
||||
subtask_text=span.get("text", ""),
|
||||
start_time=start_t,
|
||||
@@ -416,7 +416,7 @@ class PlanSubtasksMemoryModule:
|
||||
"""
|
||||
if not base_task:
|
||||
return []
|
||||
prompt = load_prompt("module_1_task_aug_axes").format(
|
||||
prompt = load_prompt("plan_task_aug_axes").format(
|
||||
base_task=base_task,
|
||||
n_synonym=axes_cfg.synonym_paraphrase,
|
||||
n_omit_arm=axes_cfg.omit_arm,
|
||||
@@ -596,7 +596,7 @@ class PlanSubtasksMemoryModule:
|
||||
)
|
||||
|
||||
# ---- Pass 2: segmentation ------------------------------------
|
||||
prompt = load_prompt("module_1_subtasks").format(
|
||||
prompt = load_prompt("plan_subtasks").format(
|
||||
episode_task=effective_task,
|
||||
min_subtask_seconds=self.config.min_subtask_seconds,
|
||||
max_steps=self.config.plan_max_steps,
|
||||
@@ -679,7 +679,7 @@ class PlanSubtasksMemoryModule:
|
||||
"action that is not in your description above.\n\n"
|
||||
)
|
||||
|
||||
prompt = load_prompt("module_1_subtasks").format(
|
||||
prompt = load_prompt("plan_subtasks").format(
|
||||
episode_task=task,
|
||||
min_subtask_seconds=self.config.min_subtask_seconds,
|
||||
max_steps=self.config.plan_max_steps,
|
||||
@@ -778,7 +778,7 @@ class PlanSubtasksMemoryModule:
|
||||
self, record: EpisodeRecord, task: str, window: tuple[float, float] | None = None
|
||||
) -> str:
|
||||
"""Grounding pass: free-form chronological description of the (windowed) video."""
|
||||
prompt = load_prompt("module_1_subtask_describe").format(episode_task=task)
|
||||
prompt = load_prompt("plan_subtask_describe").format(episode_task=task)
|
||||
text = self._vlm_field(self._video_message(record, prompt, window=window), "description")
|
||||
return text.strip() if isinstance(text, str) and text.strip() else ""
|
||||
|
||||
@@ -882,7 +882,7 @@ class PlanSubtasksMemoryModule:
|
||||
*,
|
||||
task: str | None = None,
|
||||
) -> str:
|
||||
prompt = load_prompt("module_1_memory").format(
|
||||
prompt = load_prompt("plan_memory").format(
|
||||
episode_task=(task if task is not None else record.episode_task),
|
||||
prior_memory=prior_memory or "(none)",
|
||||
completed_subtask=completed,
|
||||
|
||||
@@ -138,7 +138,13 @@ class StagingValidator:
|
||||
for row in all_rows:
|
||||
self._check_column_routing(row, report, record.episode_index)
|
||||
self._check_camera_field(row, report, record.episode_index, self.dataset_camera_keys)
|
||||
if column_for_style(row.get("style")) == LANGUAGE_PERSISTENT:
|
||||
# ``_check_column_routing`` already recorded any unknown-style error;
|
||||
# don't let the same ``column_for_style`` lookup raise here uncaught.
|
||||
try:
|
||||
column = column_for_style(row.get("style"))
|
||||
except ValueError:
|
||||
continue
|
||||
if column == LANGUAGE_PERSISTENT:
|
||||
persistent.append(row)
|
||||
else:
|
||||
events.append(row)
|
||||
|
||||
@@ -598,13 +598,3 @@ def _pil_to_data_url(image: Any) -> str:
|
||||
image.save(buf, format="PNG")
|
||||
b64 = base64.b64encode(buf.getvalue()).decode("ascii")
|
||||
return f"data:image/png;base64,{b64}"
|
||||
|
||||
|
||||
def _messages_to_prompt(messages: Sequence[dict[str, Any]]) -> Any:
|
||||
"""Pass-through hook used by the vllm backend.
|
||||
|
||||
vllm exposes its own multimodal entry points that vary by version; for the
|
||||
base flow we simply forward the raw message list and let the caller's
|
||||
custom backend handle templating. Real deployments override this.
|
||||
"""
|
||||
return list(messages)
|
||||
|
||||
@@ -113,9 +113,9 @@ def annotate(cfg: AnnotationPipelineConfig) -> None:
|
||||
logger.warning(w)
|
||||
|
||||
if cfg.push_to_hub:
|
||||
if cfg.repo_id is None and cfg.dest_repo_id is None:
|
||||
if cfg.repo_id is None and cfg.new_repo_id is None:
|
||||
raise ValueError(
|
||||
"--push_to_hub requires --repo_id or --dest_repo_id (the dataset repo to push to)."
|
||||
"--push_to_hub requires --repo_id or --new_repo_id (the dataset repo to push to)."
|
||||
)
|
||||
_push_to_hub(root, cfg)
|
||||
|
||||
@@ -123,11 +123,11 @@ def annotate(cfg: AnnotationPipelineConfig) -> None:
|
||||
def _push_to_hub(root: Path, cfg: AnnotationPipelineConfig) -> None:
|
||||
"""Upload the annotated dataset directory to the Hub.
|
||||
|
||||
Pushes to ``cfg.dest_repo_id`` when set, otherwise back to ``cfg.repo_id``.
|
||||
Pushes to ``cfg.new_repo_id`` when set, otherwise back to ``cfg.repo_id``.
|
||||
"""
|
||||
from huggingface_hub import HfApi # noqa: PLC0415
|
||||
|
||||
repo_id = cfg.dest_repo_id or cfg.repo_id
|
||||
repo_id = cfg.new_repo_id or cfg.repo_id
|
||||
commit_message = cfg.push_commit_message or "Add steerable annotations (lerobot-annotate)"
|
||||
api = HfApi()
|
||||
print(f"[lerobot-annotate] creating/locating dataset repo {repo_id}...", flush=True)
|
||||
|
||||
@@ -60,13 +60,11 @@ def _stub_responder(messages):
|
||||
{"text": "place the bottle down", "start": 2.0, "end": 3.0},
|
||||
]
|
||||
}
|
||||
if "concise hierarchical PLAN" in text:
|
||||
return {"plan": "1. grasp\n2. pour\n3. place"}
|
||||
if "Update the memory" in text:
|
||||
if "compressed semantic memory" in text:
|
||||
return {"memory": "poured once"}
|
||||
if "acknowledgement the robot" in text:
|
||||
return {"text": "Sure."}
|
||||
if "ONE realistic interruption" in text:
|
||||
if "compact interjection" in text:
|
||||
return {"interjection": "use less water", "speech": "Using less water."}
|
||||
if "frame-grounded visual question" in text:
|
||||
return {"question": "How many cups?", "answer": {"label": "cup", "count": 1}}
|
||||
@@ -94,6 +92,23 @@ def main() -> int:
|
||||
print(f"phases={[(p.name, p.episodes_processed) for p in summary.phases]}")
|
||||
print(f"validation: {summary.validation_report.summary()}")
|
||||
print(f"shards rewritten: {len(summary.written_paths)}")
|
||||
|
||||
# Assert the interjection code path actually fired — otherwise a stale
|
||||
# canned-VLM marker would silently produce zero interjections and this
|
||||
# smoke run would still "pass" by only printing.
|
||||
import pyarrow.parquet as pq # noqa: PLC0415
|
||||
|
||||
events = [
|
||||
r
|
||||
for shard in summary.written_paths
|
||||
for ev in pq.read_table(shard).column("language_events").to_pylist()
|
||||
for r in ev
|
||||
]
|
||||
n_interjections = sum(1 for r in events if r.get("style") == "interjection")
|
||||
n_speech = sum(1 for r in events if r.get("style") is None and r.get("role") == "assistant")
|
||||
print(f"interjections={n_interjections} speech_atoms={n_speech}")
|
||||
assert n_interjections > 0, "no interjection rows produced — check the interjection prompt marker"
|
||||
assert n_speech > 0, "no speech tool-call atoms produced — check the speech prompt marker"
|
||||
return 0
|
||||
|
||||
|
||||
|
||||
@@ -151,7 +151,7 @@ def test_module2_mid_episode_emits_paired_interjection_and_speech(
|
||||
{
|
||||
"acknowledgement the robot": {"text": "OK."},
|
||||
# Marker matches the distinctive line of
|
||||
# ``module_2_interjection.txt`` ("Write ONE compact
|
||||
# ``interjections_interjection.txt`` ("Write ONE compact
|
||||
# interjection ..."). Keep this in sync with that prompt's
|
||||
# wording — the canned responder matches on substring.
|
||||
"Write ONE compact interjection": {
|
||||
@@ -245,7 +245,6 @@ def test_module1_attaches_video_block_to_subtask_prompt(fixture_dataset_root: Pa
|
||||
{"text": "wipe the counter", "start": 0.5, "end": 1.1},
|
||||
]
|
||||
}
|
||||
plan_payload = {"plan": "1. grasp\n2. wipe"}
|
||||
memory_payload = {"memory": "wiped once"}
|
||||
|
||||
def responder(messages):
|
||||
@@ -255,9 +254,7 @@ def test_module1_attaches_video_block_to_subtask_prompt(fixture_dataset_root: Pa
|
||||
for block in m.get("content", []):
|
||||
if isinstance(block, dict) and block.get("type") == "text":
|
||||
text = block.get("text", "")
|
||||
if "concise hierarchical PLAN" in text:
|
||||
return plan_payload
|
||||
if "Update the memory" in text:
|
||||
if "compressed semantic memory" in text:
|
||||
return memory_payload
|
||||
return payload
|
||||
|
||||
|
||||
@@ -13,7 +13,7 @@
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""End-to-end smoke: pipeline output → PR 1 canonical recipe rendering."""
|
||||
"""End-to-end smoke: pipeline output → canonical recipe rendering."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
@@ -49,14 +49,15 @@ from lerobot.datasets.language_render import render_sample # noqa: E402
|
||||
from ._helpers import make_canned_responder # noqa: E402
|
||||
|
||||
|
||||
def _build_pr1_style_blend_recipe() -> TrainingRecipe:
|
||||
def _build_style_blend_recipe() -> TrainingRecipe:
|
||||
"""Inline blend recipe that consumes every style this pipeline produces.
|
||||
|
||||
PR 1 used to ship ``src/lerobot/configs/recipes/pi05_hirobot.yaml`` as
|
||||
a canonical example, but that file was dropped during PR 1 review. The
|
||||
cross-PR contract this test guards is "the recipe DSL can render
|
||||
non-empty messages from pipeline output", which doesn't require a
|
||||
specific YAML — so we build the equivalent blend in code.
|
||||
The language schema/DSL work used to ship
|
||||
``src/lerobot/configs/recipes/pi05_hirobot.yaml`` as a canonical
|
||||
example, but that file was dropped during review. The contract this
|
||||
test guards is "the recipe DSL can render non-empty messages from
|
||||
pipeline output", which doesn't require a specific YAML — so we build
|
||||
the equivalent blend in code.
|
||||
"""
|
||||
return TrainingRecipe(
|
||||
blend={
|
||||
@@ -109,10 +110,9 @@ def _build_executor() -> Executor:
|
||||
{"text": "place the bottle down", "start": 1.0, "end": 1.5},
|
||||
]
|
||||
},
|
||||
"concise hierarchical PLAN": {"plan": "1. grasp\n2. pour\n3. place"},
|
||||
"Update the memory": {"memory": "poured once"},
|
||||
"compressed semantic memory": {"memory": "poured once"},
|
||||
"acknowledgement the robot": {"text": "Sure."},
|
||||
"ONE realistic interruption": {
|
||||
"compact interjection": {
|
||||
"interjection": "use less water",
|
||||
"speech": "Using less water.",
|
||||
},
|
||||
@@ -137,7 +137,7 @@ def _build_executor() -> Executor:
|
||||
)
|
||||
|
||||
|
||||
def test_pr1_canonical_recipe_renders_nonempty_from_pipeline_output(
|
||||
def test_canonical_recipe_renders_nonempty_from_pipeline_output(
|
||||
single_episode_root: Path,
|
||||
) -> None:
|
||||
executor = _build_executor()
|
||||
@@ -150,7 +150,7 @@ def test_pr1_canonical_recipe_renders_nonempty_from_pipeline_output(
|
||||
events_lists = table.column("language_events").to_pylist()
|
||||
timestamps = table.column("timestamp").to_pylist()
|
||||
|
||||
recipe = _build_pr1_style_blend_recipe()
|
||||
recipe = _build_style_blend_recipe()
|
||||
|
||||
rendered_any = False
|
||||
for ts, persistent, events in zip(timestamps, persistent_lists, events_lists, strict=True):
|
||||
@@ -168,7 +168,7 @@ def test_pr1_canonical_recipe_renders_nonempty_from_pipeline_output(
|
||||
rendered_any = True
|
||||
assert result["target_message_indices"]
|
||||
break
|
||||
assert rendered_any, "PR 1 recipe rendered no messages from pipeline output"
|
||||
assert rendered_any, "recipe rendered no messages from pipeline output"
|
||||
|
||||
# Sanity: speech atom appears in events column intact
|
||||
flat_events = [r for ev in events_lists for r in ev]
|
||||
@@ -177,7 +177,7 @@ def test_pr1_canonical_recipe_renders_nonempty_from_pipeline_output(
|
||||
say = speech_rows[0]["tool_calls"][0]
|
||||
assert say["function"]["name"] == "say"
|
||||
assert isinstance(say["function"]["arguments"]["text"], str)
|
||||
# PR 2 no longer writes a ``tools`` column — the say schema lives as a
|
||||
# constant (``SAY_TOOL_SCHEMA``) so PR 1's row struct is the single
|
||||
# source of truth for the v3.1 schema.
|
||||
# The pipeline does not write a ``tools`` column — the say schema lives
|
||||
# as a constant (``SAY_TOOL_SCHEMA``) so the language row struct is the
|
||||
# single source of truth for the v3.1 schema.
|
||||
assert "tools" not in table.column_names
|
||||
|
||||
@@ -229,7 +229,7 @@ def test_writer_drops_subtask_index_idempotent(fixture_dataset_root: Path, tmp_p
|
||||
assert "language_events" in table_a.column_names
|
||||
# The writer no longer emits a dataset-level ``tools`` column; the
|
||||
# ``say`` tool schema lives as a code constant (``SAY_TOOL_SCHEMA``)
|
||||
# so the parquet stays small and PR 2 doesn't extend PR 1's schema.
|
||||
# so the parquet stays small and the pipeline doesn't extend the schema.
|
||||
assert "tools" not in table_a.column_names
|
||||
|
||||
# second pass — must produce identical bytes for the language columns
|
||||
|
||||
@@ -35,7 +35,7 @@ def test_push_to_hub_tags_uploaded_dataset_revision(tmp_path, monkeypatch):
|
||||
|
||||
cfg = SimpleNamespace(
|
||||
repo_id="source/dataset",
|
||||
dest_repo_id="annotated/dataset",
|
||||
new_repo_id="annotated/dataset",
|
||||
push_private=True,
|
||||
push_commit_message=None,
|
||||
)
|
||||
|
||||
Reference in New Issue
Block a user