lerobot/docs/source/tools.mdx

# Tools

LeRobot v3.1 supports **tool calls** in policies — assistant messages can
emit structured invocations like `say(text="OK, starting now")` that the
runtime dispatches to a real implementation (TTS, controller, logger, …).

This page covers:

1. Where the tool catalog lives (PR 1).
2. How the annotation pipeline produces tool-call atoms (PR 2).
3. How to add your own tool (PR 3).

## Where tools are declared

Two layers.

**The catalog** — a list of OpenAI-style function schemas — lives at
`meta/info.json["tools"]` on each dataset. Example:

```json
{
  "features": { "...": "..." },
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "say",
        "description": "Speak a short utterance to the user via the TTS executor.",
        "parameters": {
          "type": "object",
          "properties": {
            "text": { "type": "string", "description": "The verbatim text to speak." }
          },
          "required": ["text"]
        }
      }
    }
  ]
}
```

Read it via the dataset metadata accessor:

```python
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata

meta = LeRobotDatasetMetadata(repo_id="pepijn/super_poulain_final_annotations")
tools = meta.tools     # list[dict] — OpenAI tool schemas
```

If the dataset's `info.json` doesn't declare any tools, `meta.tools`
returns `DEFAULT_TOOLS` from `lerobot.datasets.language` — currently a
single-entry list with the canonical `say` schema. So unannotated
datasets and chat-template consumers keep working without any
configuration:

```python
prompt_str = tokenizer.apply_chat_template(
    sample["messages"],
    tools=meta.tools,                 # works either way
    add_generation_prompt=False,
    tokenize=False,
)
```

**The implementations** — runnable Python — live under
`src/lerobot/tools/`, one file per tool. The `say` implementation
arrives in PR 3 and wraps Kyutai's pocket-tts model.

## Per-row tool *invocations*

The catalog above describes *what can be called*. The actual *call* — the
function name plus the argument values — is stored per-row, on the
assistant atoms in `language_events`:

```python
{
  "role": "assistant",
  "content": null,
  "style": null,
  "timestamp": 12.4,
  "camera": null,
  "tool_calls": [
    { "type": "function",
      "function": { "name": "say", "arguments": { "text": "On it." } } }
  ]
}
```

Recipes splice these into rendered messages via `tool_calls_from`:

```yaml
user_interjection_response:
  bindings:
    speech: "emitted_at(t, role=assistant, tool_name=say)"
  messages:
    - { role: user,      content: "${task}",         stream: high_level }
    - { role: assistant, content: "${current_plan}", stream: high_level,
        target: true, tool_calls_from: speech }
```

The model's training target is one assistant turn that carries both the
plan text *and* the `say` tool call. At inference, the runtime parses
the generated text back into structured `tool_calls` and dispatches to
the matching implementation.

## How to add your own tool

Three steps. Concrete example: a `record_observation` tool the policy
can call to capture an extra observation outside the regular control
loop.

### Step 1 — declare the schema

Add an entry under `meta/info.json["tools"]`. Either edit the file
directly on disk *before* running the annotation pipeline (it'll be
preserved) or hand it to `lerobot-annotate` via a config flag (PR 2 —
exact CLI lands with the pipeline change).

```json
{
  "tools": [
    { "type": "function", "function": { "name": "say", "...": "..." } },
    {
      "type": "function",
      "function": {
        "name": "record_observation",
        "description": "Capture a high-resolution still image for the user.",
        "parameters": {
          "type": "object",
          "properties": {
            "label": { "type": "string", "description": "Short label for the saved image." }
          },
          "required": ["label"]
        }
      }
    }
  ]
}
```

The schema follows OpenAI's function-calling convention exactly, so the
chat template can render it natively.

### Step 2 — implement the call

Create `src/lerobot/tools/record_observation.py`:

```python
from .base import Tool
from typing import Any

RECORD_OBSERVATION_SCHEMA: dict[str, Any] = { "...": "..." }   # mirrors the JSON above


class RecordObservationTool:
    name = "record_observation"
    schema = RECORD_OBSERVATION_SCHEMA

    def __init__(self, schema: dict | None = None, output_dir: str = "."):
        self.output_dir = output_dir

    def call(self, arguments: dict) -> str:
        label = arguments["label"]
        # ... save the latest camera frame to <output_dir>/<label>.png ...
        return f"saved {label}.png"
```

One file per tool keeps dependencies isolated — `record_observation`
might pull `pillow`, while `say` (PR 3) pulls `pocket-tts`. Users
installing only the tools they need avoid heavy transitive deps.

### Step 3 — register it

Add to `src/lerobot/tools/registry.py` (PR 3):

```python
from .record_observation import RecordObservationTool

TOOL_REGISTRY["record_observation"] = RecordObservationTool
```

That's it. At runtime `get_tools(meta)` looks up each schema in
`meta.tools`, instantiates the matching registered class, and returns
a name → instance dict the dispatcher can route into.

## Where this fits in the three-PR stack

| Layer | PR | What lands |
|---|---|---|
| Catalog storage in `meta/info.json` + `meta.tools` accessor | PR 1 | This page; `SAY_TOOL_SCHEMA`, `DEFAULT_TOOLS` constants in `lerobot.datasets.language`; `LeRobotDatasetMetadata.tools` property |
| Annotation pipeline writes `tools` to meta after a run; honors anything users pre-populated | PR 2 | `lerobot-annotate` ensures `meta/info.json["tools"]` includes the canonical `say` and merges any user-declared tools |
| Runnable implementations under `src/lerobot/tools/`; runtime dispatcher; `say.py` wired to Kyutai's pocket-tts | PR 3 | One file per tool; `Tool` protocol; `TOOL_REGISTRY`; optional `[tools]` extra in `pyproject.toml` |

If you want to use a tool *without* writing an implementation (e.g. for
training-time chat-template formatting only), step 1 alone is enough —
the model still learns to *generate* the call. Steps 2 and 3 are only
needed to actually *execute* it at inference.