mirror of
https://github.com/huggingface/lerobot.git
synced 2026-06-18 00:37:10 +00:00
251 lines
14 KiB
Plaintext
251 lines
14 KiB
Plaintext
# Remote Inference (lerobot-policy-server)
|
||
|
||
Remote inference decouples GPU policy inference from robot control. A `lerobot-policy-server` process runs the policy on a GPU machine; the robot runs `lerobot-rollout --inference.type=remote` as a **weightless edge client** — no policy weights, no GPU, no policy processors on the robot. One GPU server can serve several robots at once, and the remote backend works with every rollout strategy (`base`, `sentry`, `highlight`, `dagger`, `episodic`).
|
||
|
||
Use remote inference when:
|
||
|
||
- The policy is too large or too slow for the machine attached to the robot (e.g. Pi0/Pi0.5 on a Raspberry Pi or laptop edge).
|
||
- You want one GPU to serve a fleet of robots running the same policy.
|
||
- You want to update or restart the inference side without touching the robots.
|
||
|
||
<Tip>
|
||
|
||
Remote inference requires the `async` extra on **both** sides: `pip install 'lerobot[async]'` (installs `eclipse-zenoh` and `msgpack`). The server additionally needs the extras of the policy it serves (e.g. `lerobot[pi]`, `lerobot[smolvla]`).
|
||
|
||
</Tip>
|
||
|
||
## Architecture
|
||
|
||
```
|
||
robot (edge, weightless) GPU machine
|
||
┌───────────────────────────┐ ┌────────────────────────────┐
|
||
│ lerobot-rollout │ │ lerobot-policy-server │
|
||
│ --inference.type=remote │ zenoh │ one process = one │
|
||
│ │ router │ (model, revision, GPU) │
|
||
│ control loop @ fps │ ┌────────┐ │ │
|
||
│ └─ pops local action ◄──┼───┤ zenohd ├─────┼─► inference worker thread │
|
||
│ buffer (chunks) │ └────────┘ │ (round-robin over │
|
||
│ │ observations ► │ client sessions) │
|
||
│ network worker thread ───┼──► ◄ action │ │
|
||
│ (publishes obs, merges │ chunks │ stateless per request │
|
||
│ chunks into buffer) │ │ │
|
||
└───────────────────────────┘ └────────────────────────────┘
|
||
```
|
||
|
||
The client keeps a local **action buffer** filled with chunks of future actions, so the control loop never blocks on the network: short network blips are absorbed by the buffer and the robot keeps moving. The client self-clocks — it requests a new chunk whenever the buffer holds less than `--inference.buffer_time_s` seconds of playback.
|
||
|
||
The server is **stateless per request**: clients ship their RTC prefixes and a delay hint with every observation, so a server crash or restart loses zero control state and reconnects are trivial. In production both robots and servers _dial out_ to a `zenohd` router (NAT-friendly: nothing on the robot network needs an open inbound port).
|
||
|
||
## Quickstart on a LAN (peer mode, no router)
|
||
|
||
For a quick test on one network you can skip the router: the server listens directly and the robot connects to it.
|
||
|
||
On the GPU machine:
|
||
|
||
```bash
|
||
lerobot-policy-server \
|
||
--model.repo_or_path=${HF_USER}/my_pi0_policy \
|
||
--default_task="pick up the cube" \
|
||
--zenoh.mode=peer \
|
||
--zenoh.listen_endpoints='["tcp/0.0.0.0:7447"]'
|
||
```
|
||
|
||
Wait for `Policy server up: ...` (the model is downloaded, loaded, and warmed up first).
|
||
|
||
On the robot machine (replace `192.168.1.42` with the GPU machine's IP):
|
||
|
||
```bash
|
||
lerobot-rollout \
|
||
--strategy.type=base \
|
||
--policy.path=${HF_USER}/my_pi0_policy \
|
||
--inference.type=remote \
|
||
--inference.zenoh_mode=peer \
|
||
--inference.connect_endpoint=tcp/192.168.1.42:7447 \
|
||
--robot.type=so100_follower \
|
||
--robot.port=/dev/ttyACM0 \
|
||
--robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
|
||
--task="pick up the cube" \
|
||
--duration=60
|
||
```
|
||
|
||
`--policy.path` on the client resolves to a config-only download (no weights): it is used for pre-flight validation and action ordering, and doubles as the default service address. The client's `--policy.path` and `--task` must match the server's `--model.repo_or_path` and `--default_task` — that pair is the namespace the service is published under (see [Troubleshooting](#troubleshooting)).
|
||
|
||
## Production deployment (router)
|
||
|
||
In production, run a [zenoh router](https://zenoh.io/docs/getting-started/installation/) (`zenohd`) somewhere both sides can reach, and have robots and servers dial out to it:
|
||
|
||
```bash
|
||
zenohd # listens on tcp/0.0.0.0:7447 by default
|
||
```
|
||
|
||
Configure the server with a YAML manifest:
|
||
|
||
```yaml
|
||
# server.yaml
|
||
model:
|
||
repo_or_path: lerobot/pi0_towels
|
||
revision: main
|
||
dtype: bfloat16 # optional cast after load
|
||
device: cuda
|
||
default_task: "fold the towel"
|
||
serving_mode: auto # shared for verified chunk-stateless policies, exclusive otherwise
|
||
max_sessions: 5
|
||
warmup_inferences: 2
|
||
trained_fps: 30.0
|
||
rtc:
|
||
enabled: true
|
||
execution_horizon: 10
|
||
max_guidance_weight: 10.0
|
||
health_port: 9100 # /healthz + /metrics; 0 disables
|
||
zenoh:
|
||
mode: client
|
||
connect_endpoints: ["tcp/router.gpu-cluster.internal:7447"]
|
||
```
|
||
|
||
```bash
|
||
lerobot-policy-server --manifest server.yaml
|
||
```
|
||
|
||
Everything in the manifest can also be set directly on the CLI (`--model.repo_or_path=...`, `--max_sessions=...`, etc.). One process serves exactly one `(model, revision, dtype, device)` — to serve two models, or one model on two GPUs, run two processes. Dynamic model loading is deliberately unsupported: pre-warmed processes keep capacity planning honest.
|
||
|
||
On the robot, only the endpoint changes (the default `--inference.zenoh_mode=client` is already router mode):
|
||
|
||
```bash
|
||
lerobot-rollout \
|
||
--strategy.type=base \
|
||
--policy.path=lerobot/pi0_towels \
|
||
--inference.type=remote \
|
||
--inference.connect_endpoint=tcp/router.gpu-cluster.internal:7447 \
|
||
--robot.type=so100_follower \
|
||
--robot.port=/dev/ttyACM0 \
|
||
--robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
|
||
--task="fold the towel" \
|
||
--duration=600
|
||
```
|
||
|
||
### TLS / mTLS
|
||
|
||
For traffic that leaves a trusted network, terminate TLS at the router and give both sides client certificates (all three PEM paths are required together):
|
||
|
||
```yaml
|
||
# server.yaml (zenoh section)
|
||
zenoh:
|
||
mode: client
|
||
connect_endpoints: ["tls/router.gpu-cluster.internal:7447"]
|
||
tls_root_ca_certificate: /etc/lerobot/ca.pem
|
||
tls_connect_certificate: /etc/lerobot/server.pem
|
||
tls_connect_private_key: /etc/lerobot/server.key
|
||
```
|
||
|
||
On the robot the equivalent flags are `--inference.tls_ca`, `--inference.tls_cert`, and `--inference.tls_key`, with `--inference.connect_endpoint=tls/...`.
|
||
|
||
<Tip>
|
||
|
||
Multicast scouting is always disabled: discovery is configuration, not protocol magic. If nothing connects, check the endpoints — there is no fallback discovery mechanism.
|
||
|
||
</Tip>
|
||
|
||
## RTC over the network
|
||
|
||
The remote engine reuses the [Real-Time Chunking](./rtc) machinery: the client keeps the chunk leftover and latency tracking locally and ships an action prefix plus a delay hint with every observation; the server runs prefix-conditioned chunk generation. This gives the same smooth chunk-to-chunk transitions as local RTC, with network latency folded into the delay computation.
|
||
|
||
RTC is enabled by default on both sides (`rtc.enabled: true`). Tune it from the client:
|
||
|
||
```bash
|
||
lerobot-rollout \
|
||
... \
|
||
--inference.type=remote \
|
||
--inference.rtc.execution_horizon=10 \
|
||
--inference.rtc.max_guidance_weight=10.0
|
||
```
|
||
|
||
If the server or its policy does not support RTC (only `pi0`, `pi05`, and `smolvla` are RTC-capable, and the server manifest must have `rtc.enabled: true`), the session is **downgraded to plain chunk-append** and the client logs:
|
||
|
||
```
|
||
RTC downgraded to chunk-append (server does not support RTC)
|
||
```
|
||
|
||
The robot still runs — chunks are simply appended to the buffer without prefix blending, which can produce visible seams between chunks on slow policies.
|
||
|
||
## Fail-safe behavior
|
||
|
||
The client runs a fail-safe state machine (`CONNECTING → STREAMING → DEGRADED → STALLED → RECONNECTING → DEAD`). A bad initial deployment fails fast: `lerobot-rollout` aborts before the robot moves if the handshake or validation fails. Once streaming, faults degrade in stages:
|
||
|
||
| Condition | Behavior |
|
||
| -------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| Short network blip / late chunk | The robot rides its action buffer; state goes `DEGRADED` after `--inference.degraded_after_s` (default 1.0 s) without a fresh chunk |
|
||
| Buffered actions older than `max_action_age_s` | Stale actions are dropped (never executed); default `--inference.max_action_age_s=3.0` |
|
||
| Buffer runs dry (`STALLED`) | Fallback per `--inference.fallback`: `hold` (default — robot holds its last commanded position), `repeat_last`, or `zero` |
|
||
| Server liveliness lost / repeated request timeouts | `RECONNECTING`: re-handshake with exponential backoff (`reconnect_initial_backoff_s=0.5` doubling up to `reconnect_max_backoff_s=10.0`) |
|
||
| Reconnected server runs a different model/revision | Hard refusal (`DEAD`) — the client never executes wrong-model chunks |
|
||
| Offline longer than `max_offline_s` (default 60 s) | `DEAD`: the engine signals the rollout's shutdown event for a clean stop |
|
||
|
||
<Tip warning={true}>
|
||
|
||
`--inference.fallback=zero` is required for velocity-controlled robots: for them "send nothing" means "keep the last velocity", so an explicit zero command is the only safe stop. For position-controlled arms the default `hold` is safe.
|
||
|
||
</Tip>
|
||
|
||
Server restarts are equally graceful: on SIGTERM the server drops its liveliness token first (clients ride their buffers through the drain), finishes the in-flight inference, and exits. Clients reconnect when the replacement comes up.
|
||
|
||
## Serving multiple robots
|
||
|
||
`max_sessions` caps concurrent clients per server process. A single inference worker thread serializes GPU access and round-robins over sessions with a pending observation; per-client newest-wins mailboxes mean overload degrades into longer cycle times (larger but correct client-side delays), never into queue buildup.
|
||
|
||
A rough capacity estimate, keeping ~20% headroom:
|
||
|
||
```
|
||
N_robots ≈ 0.8 / (rate × inference_time)
|
||
```
|
||
|
||
where `rate` is each robot's chunk-request rate in Hz (how often the client's buffer dips below `buffer_time_s`) and `inference_time` is the server's seconds per chunk. For example, at 100 ms per chunk and ~2 chunk requests per second per robot: `N ≈ 0.8 / (2 × 0.1) = 4` robots.
|
||
|
||
The actual serving mode is classified per policy family, never inferred:
|
||
|
||
- **shared** — verified chunk-stateless policies (`act`, `pi0`, `pi05`, and `smolvla` with `n_obs_steps=1`) serve up to `max_sessions` clients from one policy instance.
|
||
- **exclusive** — stateful families (diffusion-family policies, `smolvla` with observation history, and any unverified policy) are forced to `max_sessions=1`. Run one server process per robot for these.
|
||
|
||
`serving_mode: auto` (the default) resolves this automatically; you may force `exclusive`, but `shared` can never override a stateful classification.
|
||
|
||
## Observability
|
||
|
||
With `health_port` set (default 9100), the server exposes:
|
||
|
||
- `GET /healthz` — `200 ok` while the inference worker is alive, `503` otherwise. Wire this to your orchestrator's liveness probe.
|
||
- `GET /metrics` — Prometheus text format: `lerobot_policy_server_requests_total`, `errors_total`, `superseded_total`, `dropped_unknown_client_total`, `sessions_opened_total`, `sessions_closed_total`, `active_sessions`, `server_load`.
|
||
|
||
Every inference request also emits one structured audit line on the `lerobot.policy_server.audit` logger:
|
||
|
||
```json
|
||
{
|
||
"session_id": "9f2c...",
|
||
"client_uuid": "robot-07",
|
||
"seq_id": 412,
|
||
"episode_id": 3,
|
||
"queue_wait_ms": 1.8,
|
||
"inference_ms": 93.2,
|
||
"superseded": 0,
|
||
"outcome": "ok"
|
||
}
|
||
```
|
||
|
||
`(session_id, seq_id)` correlates a server-side audit line with the client's request. Set a stable `--inference.client_uuid` per robot (instead of the default fresh UUID per run) for fleet-wide log correlation, and use `--inference.tags` to forward free-form labels in the handshake.
|
||
|
||
## Troubleshooting
|
||
|
||
**`No policy server answered status query at '@lerobot/...'`**
|
||
|
||
The client found no server under the key it dialed. Either the endpoint is wrong (check `--inference.connect_endpoint`, the router, and firewalls), or the **service namespace** does not match. The namespace is the `(model_id, revision, task)` triple: on the client it comes from `--inference.service_model_id` (default: `--policy.path`), `--inference.service_revision` (default: `main`), and `--inference.service_task` (default: the rollout `--task`); on the server from `model.repo_or_path`, `model.revision`, and `service_name` (default: a slug of `default_task`). A robot task string that differs from the server's `default_task` is the most common cause — fix the task, or pin the namespace explicitly with `--inference.service_task` on the client / `service_name` in the manifest.
|
||
|
||
**`Action name/order mismatch between server policy and this robot`**
|
||
|
||
The hard sync-safety contract: chunk columns map to motors **by order**, so the robot's ordered action keys must exactly equal the policy's `action_feature_names`. This fires when the robot type, motor naming, or rename map differs from the training setup. Use the same robot type (and rename map) the policy was trained with.
|
||
|
||
**`RTC requested but this server/policy does not support it — downgrading to chunk-append`**
|
||
|
||
Informational, not fatal. Enable RTC in the server manifest (`rtc.enabled: true`) and make sure the policy family is RTC-capable (`pi0`, `pi05`, `smolvla`). Otherwise, expect chunk-append behavior (see [RTC over the network](#rtc-over-the-network)).
|
||
|
||
**`server full: N/N sessions active`**
|
||
|
||
The session-open was rejected at capacity. Raise `max_sessions` (shared mode only), or point the robot at another server replica — the rejection includes the current load so orchestration can retry elsewhere.
|