lerobot/docs/source/annotation_pipeline.mdx

# Annotation Pipeline

`lerobot-annotate` populates the two language columns introduced by the
[Language Columns and Recipes](./language_and_recipes) page —
`language_persistent` and `language_events` — directly into
`data/chunk-*/file-*.parquet`. There is no flavor namespace and no sidecar
file tree: multiple revisions of a dataset mean multiple dataset copies.

## What the pipeline produces

Three modules write into a per-episode staging tree, then a single writer
rewrites the data shards in place:

| Style / atom                                | Column                | Module   |
| ------------------------------------------- | --------------------- | -------- |
| `subtask` (Pi0.7-style "how, not what")     | `language_persistent` | Module 1 |
| `plan` (initial + refresh on interjection)  | `language_persistent` | Module 1 |
| `memory` (MEM-style compression)            | `language_persistent` | Module 1 |
| `interjection`                              | `language_events`     | Module 2 |
| speech tool-call atom (`style=null`, `say`) | `language_events`     | Module 2 |
| `vqa` (user / assistant pair)               | `language_events`     | Module 3 |

The writer also adds a dataset-level `tools` column carrying the JSON schema
for the `say` tool call, and drops the legacy `subtask_index` column.

## How to run it locally or on SLURM

Install the extra and invoke the console script:

```bash
uv sync --extra annotations
uv run lerobot-annotate \
  --repo_id=imstevenpmwork/super_poulain_draft \
  --vlm.backend=vllm \
  --vlm.model_id=Qwen/Qwen3.6-27B-FP8 \
  --vlm.tensor_parallel_size=2
```

The pipeline attaches actual camera footage to every Module 1/2/3 prompt
by default, decoded from the dataset's first `observation.images.*`
stream. Override with `--vlm.camera_key=observation.images.<name>` to
pin a specific viewpoint. Datasets with no video tracks fall back to
text-only prompts automatically.

**Module 1 sees the whole episode as one video block.** Subtask
decomposition gets a `{"type":"video", "video":[<frames>]}` block
covering the entire demonstration; Qwen-VL pools temporally on its own
and decides where to cut. There is no keyframe stride or count knob —
`--module_1.max_video_frames` (default 32) only caps the frames packed
into the video block as a model-capacity bound. Module 2 attaches a
single still frame at the interjection timestamp; Module 3 attaches the
exact emission frame to each VQA pair.

The executor picks `LocalPipelineExecutor` for small datasets and
`SlurmPipelineExecutor` for large ones based on
`--executor.auto_threshold` (default 32 episodes). Force local with
`--executor.force_local=true`. SLURM jobs honour `--executor.slurm_partition`,
`--executor.slurm_gpus`, and `--executor.slurm_time`.

## Style-to-recipe consumer mapping

The pipeline produces exactly the styles consumed by
`src/lerobot/configs/recipes/pi05_hirobot.yaml`:

- `low_level_execution`, `high_level_subtask`, `memory_update` consume
  `subtask`/`plan`/`memory` from `language_persistent`.
- `user_interjection_response` consumes `interjection` events plus the
  paired speech atom (merged into one assistant target turn via
  `tool_calls_from`) and the same-timestamp `plan` refresh.
- `ask_vqa` consumes the `(vqa, user)` and `(vqa, assistant)` pairs from
  `language_events`.

## Why the design is scoped to the canonical recipe

Two things drive the scope:

1. **Persistent state vs exact-event split.** Persistent rows (`subtask`,
   `plan`, `memory`) broadcast per episode and answer "what state is in
   force at this frame?". Event rows (`interjection`, `vqa`, speech) only
   appear on the exact frame whose timestamp matches the emission. The
   pipeline writes timestamps taken straight from the source parquet — no
   floating-point recomputation.
2. **One Qwen-VL pass.** All three modules share a single VLM client
   (vLLM if available, transformers fallback) so the cost is one model
   load per dataset, not three.

## Module independence and staged reruns

Each module writes its raw output to
`<root>/.annotate_staging/episode_{N:06d}/<module>.jsonl`. That makes
prompt iteration cheap — re-running one module overwrites only its own
JSONL file before the writer composes the final parquet. Modules can be
disabled via `--module_1.enabled=false` (and similarly for 2 and 3) to
test them in isolation.

## Validation/report checks before final write

Before the writer runs, `StagingValidator` checks:

- exact frame-timestamp alignment for every event row;
- no orphan speech / interjection pairs;
- `plan` is refreshed at every interjection timestamp;
- `memory` rows fall on subtask boundaries (warning, not error);
- VQA assistant `content` parses as JSON in one of the
  bbox / keypoint / count / attribute / spatial shapes;
- every row routes to the column dictated by `column_for_style(style)`.

Errors abort the writer (`--skip_validation=true` overrides for debugging).

## Paper inspirations per module

- **Module 1 — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
  atom granularity ("pick up one piece of lettuce", "place bowl to box");
  Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07)) "how, not
  what" detail.
- **Module 1 — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596))
  compression directive: keep only minimal relevant information; functional
  outcomes preserved, specific attributes dropped.
- **Module 2 — interjections.** Hi Robot scenario taxonomy: negative task,
  situated correction, specific constraint, preference. Speech is a
  tool-call-only atom (`tool_calls=[{type:function, function:{name:"say",
arguments:{text:...}}}]`).
- **Module 3 — VQA.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693))
  grounded features (bounding boxes in pixel `[x_min, y_min, x_max, y_max]`,
  keypoints) and Steerable Policies' multi-abstraction grounding.

Future maintainers should adjust the prompt templates in
`src/lerobot/annotations/steerable_pipeline/prompts/` against these
references rather than rewriting from scratch.

## Compute and list-size estimates

Per episode, the pipeline issues O(`max_steps`) Module 1 calls,
O(`max_interjections_per_episode`) Module 2 calls, and
O(`vqa_emission_hz × episode_seconds`) Module 3 calls. With defaults
(8 subtasks, 1 interjection, 1 Hz × 3 pairs) and 30-second episodes, that
is ~50 VLM calls per episode. `language_persistent` per episode is ~10s of
KB at most (parquet dictionary-encodes one entry per episode);
`language_events` is empty on most frames and is bounded by the number of
emissions, not `num_frames × num_emissions`.

## Reproducibility via seed and prompt hashes

`--seed` (default 1729) feeds the per-episode RNGs that select interjection
timestamps and VQA question types. Combined with the deterministic prompt
templates checked into `prompts/`, two runs at the same seed against the
same dataset and the same model checkpoint produce byte-identical staging
artifacts. Prompt edits are recorded by file hash; future tooling can pin
expected `(seed, prompt_hash)` pairs into the dataset card.