diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 5ca449145..42ea06d7d 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -33,6 +33,8 @@ title: Using the Dataset Tools - local: language_and_recipes title: Language Columns and Recipes + - local: tools + title: Tools - local: streaming_video_encoding title: Streaming Video Encoding title: "Datasets" diff --git a/docs/source/tools.mdx b/docs/source/tools.mdx new file mode 100644 index 000000000..47d556a2e --- /dev/null +++ b/docs/source/tools.mdx @@ -0,0 +1,198 @@ +# Tools + +LeRobot v3.1 supports **tool calls** in policies — assistant messages can +emit structured invocations like `say(text="OK, starting now")` that the +runtime dispatches to a real implementation (TTS, controller, logger, …). + +This page covers: + +1. Where the tool catalog lives (PR 1). +2. How the annotation pipeline produces tool-call atoms (PR 2). +3. How to add your own tool (PR 3). + +## Where tools are declared + +Two layers. + +**The catalog** — a list of OpenAI-style function schemas — lives at +`meta/info.json["tools"]` on each dataset. Example: + +```json +{ + "features": { "...": "..." }, + "tools": [ + { + "type": "function", + "function": { + "name": "say", + "description": "Speak a short utterance to the user via the TTS executor.", + "parameters": { + "type": "object", + "properties": { + "text": { "type": "string", "description": "The verbatim text to speak." } + }, + "required": ["text"] + } + } + } + ] +} +``` + +Read it via the dataset metadata accessor: + +```python +from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata + +meta = LeRobotDatasetMetadata(repo_id="pepijn/super_poulain_final_annotations") +tools = meta.tools # list[dict] — OpenAI tool schemas +``` + +If the dataset's `info.json` doesn't declare any tools, `meta.tools` +returns `DEFAULT_TOOLS` from `lerobot.datasets.language` — currently a +single-entry list with the canonical `say` schema. So unannotated +datasets and chat-template consumers keep working without any +configuration: + +```python +prompt_str = tokenizer.apply_chat_template( + sample["messages"], + tools=meta.tools, # works either way + add_generation_prompt=False, + tokenize=False, +) +``` + +**The implementations** — runnable Python — live under +`src/lerobot/tools/`, one file per tool. The `say` implementation +arrives in PR 3 and wraps Kyutai's pocket-tts model. + +## Per-row tool *invocations* + +The catalog above describes *what can be called*. The actual *call* — the +function name plus the argument values — is stored per-row, on the +assistant atoms in `language_events`: + +```python +{ + "role": "assistant", + "content": null, + "style": null, + "timestamp": 12.4, + "camera": null, + "tool_calls": [ + { "type": "function", + "function": { "name": "say", "arguments": { "text": "On it." } } } + ] +} +``` + +Recipes splice these into rendered messages via `tool_calls_from`: + +```yaml +user_interjection_response: + bindings: + speech: "emitted_at(t, role=assistant, tool_name=say)" + messages: + - { role: user, content: "${task}", stream: high_level } + - { role: assistant, content: "${current_plan}", stream: high_level, + target: true, tool_calls_from: speech } +``` + +The model's training target is one assistant turn that carries both the +plan text *and* the `say` tool call. At inference, the runtime parses +the generated text back into structured `tool_calls` and dispatches to +the matching implementation. + +## How to add your own tool + +Three steps. Concrete example: a `record_observation` tool the policy +can call to capture an extra observation outside the regular control +loop. + +### Step 1 — declare the schema + +Add an entry under `meta/info.json["tools"]`. Either edit the file +directly on disk *before* running the annotation pipeline (it'll be +preserved) or hand it to `lerobot-annotate` via a config flag (PR 2 — +exact CLI lands with the pipeline change). + +```json +{ + "tools": [ + { "type": "function", "function": { "name": "say", "...": "..." } }, + { + "type": "function", + "function": { + "name": "record_observation", + "description": "Capture a high-resolution still image for the user.", + "parameters": { + "type": "object", + "properties": { + "label": { "type": "string", "description": "Short label for the saved image." } + }, + "required": ["label"] + } + } + } + ] +} +``` + +The schema follows OpenAI's function-calling convention exactly, so the +chat template can render it natively. + +### Step 2 — implement the call + +Create `src/lerobot/tools/record_observation.py`: + +```python +from .base import Tool +from typing import Any + +RECORD_OBSERVATION_SCHEMA: dict[str, Any] = { "...": "..." } # mirrors the JSON above + + +class RecordObservationTool: + name = "record_observation" + schema = RECORD_OBSERVATION_SCHEMA + + def __init__(self, schema: dict | None = None, output_dir: str = "."): + self.output_dir = output_dir + + def call(self, arguments: dict) -> str: + label = arguments["label"] + # ... save the latest camera frame to /