feat(language): per-camera tagging on view-dependent styles

Adds a nullable `camera` field to the language row struct (both persistent and event variants) so view-dependent styles like `vqa` can carry which `observation.images.*` view they were grounded against. Without this, multi-camera datasets ended up with multiple `(vqa, role)` rows at the same timestamp that the resolver could not disambiguate. - `language.py`: add `camera` to PERSISTENT_ROW_FIELDS / EVENT_ROW_FIELDS, to both Arrow struct types and the HF datasets feature mappings; introduce VIEW_DEPENDENT_STYLES = {vqa, motion, trace} plus `is_view_dependent_style` and `validate_camera_field` helpers (camera required iff style is view-dependent). - `language_render.py`: thread an optional `camera=` kwarg through every resolver (`active_at`, `emitted_at`, `nth_prev`, `nth_next`) and through `_matching_rows` / `_select_*`, so recipes can disambiguate per-camera VQA with `emitted_at(t, style=vqa, role=assistant, camera=...)`. Without a `camera` filter, multi-row matches keep raising the existing ambiguity error — which is the desired behaviour on multi-camera data. - `recipes/pi05_hirobot.yaml`: replace the single `ask_vqa` branch with `ask_vqa_top` and `ask_vqa_wrist` per-camera sub-recipes (each carrying the matching image block), keeping the original 0.20 budget and documenting the customization point for datasets with different cameras. - Tests: schema test asserts the new field order; new tests cover `is_view_dependent_style`, `validate_camera_field` (both required and forbidden directions), per-camera `emitted_at` filtering, and the ambiguity error when two cameras emit `(vqa, assistant)` at the same timestamp without a `camera=` filter. RenderMessagesStep + dataset passthrough fixtures updated to include the new field. - `docs/source/language_and_recipes.mdx`: document the `camera` field, the per-camera resolver pattern, and the canonical recipe convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-07-10 03:21:54 +00:00 · 2026-04-30 10:48:17 +02:00
parent 0b06790da0
commit 5a6aa64570
8 changed files with 344 additions and 33 deletions
@@ -6,16 +6,24 @@ The two optional columns are:
 - `language_persistent`: a list of rows broadcast across every frame in an episode for state that remains active, such as `subtask`, `plan`, and `memory`.
 - `language_events`: a list of rows only on the exact frame where an event was emitted, such as `interjection`, `vqa`, and speech tool calls.

-Both columns share the same row shape:
+Both columns share the same row shape (event rows omit `timestamp` because the
+frame the row sits on already provides it):

 ```text
 role: string
 content: string | null
 style: string | null
-timestamp: float64
+timestamp: float64        # persistent rows only
+camera: string | null     # observation.images.* feature key, view-dependent rows only
 tool_calls: list[Json] | null
 ```

+The `camera` field tags rows whose `content` is grounded in a specific camera
+view. Rows of view-dependent styles (`vqa`, and the reserved `motion` /
+`trace`) MUST set `camera` to the matching `observation.images.*` feature key.
+Rows of every other style MUST leave `camera` as `null`. Pipeline writers and
+the validator enforce this via `validate_camera_field(style, camera)`.
+
 `meta/tasks.parquet` remains the canonical source for the task. The special `${task}` recipe binding always reads that task string and does not depend on language annotations.

 ## Architecture
@@ -39,11 +47,37 @@ Persistent styles are active after emission until replaced:
 Event styles only exist on their exact timestamp:

 - `emitted_at(t, style=interjection)`
- `emitted_at(t, style=vqa, role=user)`
+- `emitted_at(t, style=vqa, role=user, camera=observation.images.top)`
 - `emitted_at(t, role=assistant, tool_name=say)`

 Exact event matching has no tolerance window, so writers must stamp event rows with frame timestamps from the parquet data.

+## View-dependent resolution
+
+For view-dependent styles (`vqa`, `motion`, `trace`), the resolver gains a
+`camera=` filter parallel to `role=` and `tool_name=`. Datasets with multiple
+cameras typically emit one (`vqa`, `user`) + (`vqa`, `assistant`) pair per
+camera at the same timestamp; without `camera=`, those resolvers see two
+matches and raise an ambiguity error. Recipes consume each camera through its
+own binding plus a matching image block, e.g.
+
+```yaml
+ask_vqa_top:
+  bindings:
+    vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.top)"
+    vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.top)"
+  messages:
+    - role: user
+      stream: high_level
+      if_present: vqa_query
+      content:
+        - { type: image, feature: observation.images.top }
+        - { type: text, text: "${vqa_query}" }
+    - { role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa }
+```
+
+Add one such sub-recipe per camera the dataset records.
+
 ## Recipe anatomy

 Recipes are YAML files backed by `TrainingRecipe` and `MessageTurn`.