feat: language annotation pipeline (PR 2/3)

mirror of https://github.com/huggingface/lerobot.git synced 2026-05-16 09:09:48 +00:00

Adds the steerable annotation pipeline (`lerobot-annotate`) that populates
the `language_persistent` and `language_events` columns introduced in
PR 1 directly into `data/chunk-*/file-*.parquet`. No flavor namespace,
no sidecar tree.

Modules produced:
- Module 1 (plan_subtasks_memory): Pi0.7-style subtasks, plan (init +
  refresh on interjection), MEM-style memory at subtask boundaries.
- Module 2 (interjections_and_speech): t=0 speech-only acknowledgement,
  mid-episode paired interjection + speech tool-call atom.
- Module 3 (general_vqa): bbox/keypoint/count/attribute/spatial pairs at
  configurable cadence with one-retry JSON validation.

Writer enforces: per-episode persistent identity, exact-frame event
timestamps, column routing per `column_for_style`, dataset-level `tools`
column with the `say` schema, drops legacy `subtask_index`. Validator
runs against staged JSONL artifacts before the writer rewrites parquet.

Adds `lerobot-annotate` console script, `annotations` extra (datatrove +
optional vllm), `make annotation-e2e` opt-in smoke target, and
`docs/source/annotation_pipeline.mdx`.

Branched from PR 1 (`feat/language-columns`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This commit is contained in:

Pepijn

2026-04-27 16:22:51 +02:00

parent 1ca38d9748

commit 785cee429e

33 changed files with 3409 additions and 0 deletions

									
										Makefile
									
		+6
		
												View File
												
				@@ -178,3 +178,9 @@ test-smolvla-ete-eval:

						--env.episode_length=5 \

						--eval.n_episodes=1 \

						--eval.batch_size=1

				# E2E annotation pipeline smoke test against a tiny in-memory fixture

				# dataset. Opt-in (not part of `make test-end-to-end`) and uses a stub VLM

				# backend, so it does not require a real model checkpoint or GPU.

				annotation-e2e:

					uv run python -m tests.annotations.run_e2e_smoke