feat(annotate): Module 1 sees the whole episode as one video block

mirror of https://github.com/huggingface/lerobot.git synced 2026-05-20 11:09:59 +00:00

Replaces keyframe sampling with a single Qwen-VL video block covering
the whole demonstration. The model pools temporally itself and chooses
where to cut subtasks — no stride, no count, no keyframe count knob to
tune.

- frames.py: ``FrameProvider`` gains ``video_for_episode(record,
  max_frames)``; ``VideoFrameProvider`` samples up to ``max_frames``
  uniformly across the episode duration; ``_NullProvider`` returns []
  for the no-video fallback. New ``to_video_block`` helper.
- Module 1: drops keyframe sampling. The subtask prompt now goes out as
  ``[{"type":"video", "video":[<frames>]}, {"type":"text", ...}]`` and
  the prompt template asks the model to "watch the whole clip, then
  segment it" with cut points decided from gripper/contact/regrasp
  events the model sees.
- Module1Config: ``keyframes_per_episode`` removed; replaced with
  ``max_video_frames: int = 32`` (model-capacity bound, not annotation
  logic).
- Test: ``test_module1_attaches_video_block_to_subtask_prompt`` locks in
  the single-video-block invariant.
- Stub-VLM markers updated: tests now key on "atomic subtasks" instead
  of the old "Decompose the demonstration" phrase that no longer
  appears in the prompt.
- Docs: updated to describe the whole-episode video-block behavior and
  the no-video fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This commit is contained in:

Pepijn

2026-04-27 17:08:36 +02:00

parent 4f4dd49972

commit 9d5aa1c63e

8 changed files with 143 additions and 27 deletions

									
										tests/annotations/run_e2e_smoke.py
									
		+1
		-1
	
												View File
												
				@@ -79,7 +79,7 @@ def _stub_responder(messages):

				                        text = block.get("text", "")

				            elif isinstance(content, str):

				                text = content

				    if "Decompose the demonstration" in text:

				    if "atomic subtasks" in text:

				        return {

				            "subtasks": [

				                {"text": "grasp the bottle", "start": 0.0, "end": 1.0},