# Remote Inference (lerobot-policy-server) Remote inference decouples GPU policy inference from robot control. A `lerobot-policy-server` process runs the policy on a GPU machine; the robot runs `lerobot-rollout --inference.type=remote` as a **weightless edge client** — no policy weights, no GPU, no policy processors on the robot. One GPU server can serve several robots at once, and the remote backend works with every rollout strategy (`base`, `sentry`, `highlight`, `dagger`, `episodic`). Use remote inference when: - The policy is too large or too slow for the machine attached to the robot (e.g. Pi0/Pi0.5 on a Raspberry Pi or laptop edge). - You want one GPU to serve a fleet of robots running the same policy. - You want to update or restart the inference side without touching the robots. Remote inference requires the `async` extra on **both** sides: `pip install 'lerobot[async]'` (installs `eclipse-zenoh` and `msgpack`). The server additionally needs the extras of the policy it serves (e.g. `lerobot[pi]`, `lerobot[smolvla]`). ## Architecture ``` robot (edge, weightless) GPU machine ┌───────────────────────────┐ ┌────────────────────────────┐ │ lerobot-rollout │ │ lerobot-policy-server │ │ --inference.type=remote │ zenoh │ one process = one │ │ │ router │ (model, revision, GPU) │ │ control loop @ fps │ ┌────────┐ │ │ │ └─ pops local action ◄──┼───┤ zenohd ├─────┼─► inference worker thread │ │ buffer (chunks) │ └────────┘ │ (round-robin over │ │ │ observations ► │ client sessions) │ │ network worker thread ───┼──► ◄ action │ │ │ (publishes obs, merges │ chunks │ stateless per request │ │ chunks into buffer) │ │ │ └───────────────────────────┘ └────────────────────────────┘ ``` The client keeps a local **action buffer** filled with chunks of future actions, so the control loop never blocks on the network: short network blips are absorbed by the buffer and the robot keeps moving. The client self-clocks — it requests a new chunk whenever the buffer holds less than `--inference.buffer_time_s` seconds of playback. The server is **stateless per request**: clients ship their RTC prefixes and a delay hint with every observation, so a server crash or restart loses zero control state and reconnects are trivial. In production both robots and servers _dial out_ to a `zenohd` router (NAT-friendly: nothing on the robot network needs an open inbound port). ## Quickstart on a LAN (peer mode, no router) For a quick test on one network you can skip the router: the server listens directly and the robot connects to it. On the GPU machine: ```bash lerobot-policy-server \ --model.repo_or_path=${HF_USER}/my_pi0_policy \ --default_task="pick up the cube" \ --zenoh.mode=peer \ --zenoh.listen_endpoints='["tcp/0.0.0.0:7447"]' ``` Wait for `Policy server up: ...` (the model is downloaded, loaded, and warmed up first). On the robot machine (replace `192.168.1.42` with the GPU machine's IP): ```bash lerobot-rollout \ --strategy.type=base \ --policy.path=${HF_USER}/my_pi0_policy \ --inference.type=remote \ --inference.zenoh_mode=peer \ --inference.connect_endpoint=tcp/192.168.1.42:7447 \ --robot.type=so100_follower \ --robot.port=/dev/ttyACM0 \ --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \ --task="pick up the cube" \ --duration=60 ``` `--policy.path` on the client resolves to a config-only download (no weights): it is used for pre-flight validation and action ordering, and doubles as the default service address. The client's `--policy.path` and `--task` must match the server's `--model.repo_or_path` and `--default_task` — that pair is the namespace the service is published under (see [Troubleshooting](#troubleshooting)). ## Production deployment (router) In production, run a [zenoh router](https://zenoh.io/docs/getting-started/installation/) (`zenohd`) somewhere both sides can reach, and have robots and servers dial out to it: ```bash zenohd # listens on tcp/0.0.0.0:7447 by default ``` Configure the server with a YAML manifest: ```yaml # server.yaml model: repo_or_path: lerobot/pi0_towels revision: main dtype: bfloat16 # optional cast after load device: cuda default_task: "fold the towel" serving_mode: auto # shared for verified chunk-stateless policies, exclusive otherwise max_sessions: 5 warmup_inferences: 2 trained_fps: 30.0 rtc: enabled: true execution_horizon: 10 max_guidance_weight: 10.0 health_port: 9100 # /healthz + /metrics; 0 disables zenoh: mode: client connect_endpoints: ["tcp/router.gpu-cluster.internal:7447"] ``` ```bash lerobot-policy-server --manifest server.yaml ``` Everything in the manifest can also be set directly on the CLI (`--model.repo_or_path=...`, `--max_sessions=...`, etc.). One process serves exactly one `(model, revision, dtype, device)` — to serve two models, or one model on two GPUs, run two processes. Dynamic model loading is deliberately unsupported: pre-warmed processes keep capacity planning honest. On the robot, only the endpoint changes (the default `--inference.zenoh_mode=client` is already router mode): ```bash lerobot-rollout \ --strategy.type=base \ --policy.path=lerobot/pi0_towels \ --inference.type=remote \ --inference.connect_endpoint=tcp/router.gpu-cluster.internal:7447 \ --robot.type=so100_follower \ --robot.port=/dev/ttyACM0 \ --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \ --task="fold the towel" \ --duration=600 ``` ### TLS / mTLS For traffic that leaves a trusted network, terminate TLS at the router and give both sides client certificates (all three PEM paths are required together): ```yaml # server.yaml (zenoh section) zenoh: mode: client connect_endpoints: ["tls/router.gpu-cluster.internal:7447"] tls_root_ca_certificate: /etc/lerobot/ca.pem tls_connect_certificate: /etc/lerobot/server.pem tls_connect_private_key: /etc/lerobot/server.key ``` On the robot the equivalent flags are `--inference.tls_ca`, `--inference.tls_cert`, and `--inference.tls_key`, with `--inference.connect_endpoint=tls/...`. Multicast scouting is always disabled: discovery is configuration, not protocol magic. If nothing connects, check the endpoints — there is no fallback discovery mechanism. ## RTC over the network The remote engine reuses the [Real-Time Chunking](./rtc) machinery: the client keeps the chunk leftover and latency tracking locally and ships an action prefix plus a delay hint with every observation; the server runs prefix-conditioned chunk generation. This gives the same smooth chunk-to-chunk transitions as local RTC, with network latency folded into the delay computation. RTC is enabled by default on both sides (`rtc.enabled: true`). Tune it from the client: ```bash lerobot-rollout \ ... \ --inference.type=remote \ --inference.rtc.execution_horizon=10 \ --inference.rtc.max_guidance_weight=10.0 ``` If the server or its policy does not support RTC (only `pi0`, `pi05`, and `smolvla` are RTC-capable, and the server manifest must have `rtc.enabled: true`), the session is **downgraded to plain chunk-append** and the client logs: ``` RTC downgraded to chunk-append (server does not support RTC) ``` The robot still runs — chunks are simply appended to the buffer without prefix blending, which can produce visible seams between chunks on slow policies. ## Fail-safe behavior The client runs a fail-safe state machine (`CONNECTING → STREAMING → DEGRADED → STALLED → RECONNECTING → DEAD`). A bad initial deployment fails fast: `lerobot-rollout` aborts before the robot moves if the handshake or validation fails. Once streaming, faults degrade in stages: | Condition | Behavior | | -------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | | Short network blip / late chunk | The robot rides its action buffer; state goes `DEGRADED` after `--inference.degraded_after_s` (default 1.0 s) without a fresh chunk | | Buffered actions older than `max_action_age_s` | Stale actions are dropped (never executed); default `--inference.max_action_age_s=3.0` | | Buffer runs dry (`STALLED`) | Fallback per `--inference.fallback`: `hold` (default — robot holds its last commanded position), `repeat_last`, or `zero` | | Server liveliness lost / repeated request timeouts | `RECONNECTING`: re-handshake with exponential backoff (`reconnect_initial_backoff_s=0.5` doubling up to `reconnect_max_backoff_s=10.0`) | | Reconnected server runs a different model/revision | Hard refusal (`DEAD`) — the client never executes wrong-model chunks | | Offline longer than `max_offline_s` (default 60 s) | `DEAD`: the engine signals the rollout's shutdown event for a clean stop | `--inference.fallback=zero` is required for velocity-controlled robots: for them "send nothing" means "keep the last velocity", so an explicit zero command is the only safe stop. For position-controlled arms the default `hold` is safe. Server restarts are equally graceful: on SIGTERM the server drops its liveliness token first (clients ride their buffers through the drain), finishes the in-flight inference, and exits. Clients reconnect when the replacement comes up. ## Serving multiple robots `max_sessions` caps concurrent clients per server process. A single inference worker thread serializes GPU access and round-robins over sessions with a pending observation; per-client newest-wins mailboxes mean overload degrades into longer cycle times (larger but correct client-side delays), never into queue buildup. A rough capacity estimate, keeping ~20% headroom: ``` N_robots ≈ 0.8 / (rate × inference_time) ``` where `rate` is each robot's chunk-request rate in Hz (how often the client's buffer dips below `buffer_time_s`) and `inference_time` is the server's seconds per chunk. For example, at 100 ms per chunk and ~2 chunk requests per second per robot: `N ≈ 0.8 / (2 × 0.1) = 4` robots. The actual serving mode is classified per policy family, never inferred: - **shared** — verified chunk-stateless policies (`act`, `pi0`, `pi05`, and `smolvla` with `n_obs_steps=1`) serve up to `max_sessions` clients from one policy instance. - **exclusive** — stateful families (diffusion-family policies, `smolvla` with observation history, and any unverified policy) are forced to `max_sessions=1`. Run one server process per robot for these. `serving_mode: auto` (the default) resolves this automatically; you may force `exclusive`, but `shared` can never override a stateful classification. ## Observability With `health_port` set (default 9100), the server exposes: - `GET /healthz` — `200 ok` while the inference worker is alive, `503` otherwise. Wire this to your orchestrator's liveness probe. - `GET /metrics` — Prometheus text format: `lerobot_policy_server_requests_total`, `errors_total`, `superseded_total`, `dropped_unknown_client_total`, `sessions_opened_total`, `sessions_closed_total`, `active_sessions`, `server_load`. Every inference request also emits one structured audit line on the `lerobot.policy_server.audit` logger: ```json { "session_id": "9f2c...", "client_uuid": "robot-07", "seq_id": 412, "episode_id": 3, "queue_wait_ms": 1.8, "inference_ms": 93.2, "superseded": 0, "outcome": "ok" } ``` `(session_id, seq_id)` correlates a server-side audit line with the client's request. Set a stable `--inference.client_uuid` per robot (instead of the default fresh UUID per run) for fleet-wide log correlation, and use `--inference.tags` to forward free-form labels in the handshake. ## Troubleshooting **`No policy server answered status query at '@lerobot/...'`** The client found no server under the key it dialed. Either the endpoint is wrong (check `--inference.connect_endpoint`, the router, and firewalls), or the **service namespace** does not match. The namespace is the `(model_id, revision, task)` triple: on the client it comes from `--inference.service_model_id` (default: `--policy.path`), `--inference.service_revision` (default: `main`), and `--inference.service_task` (default: the rollout `--task`); on the server from `model.repo_or_path`, `model.revision`, and `service_name` (default: a slug of `default_task`). A robot task string that differs from the server's `default_task` is the most common cause — fix the task, or pin the namespace explicitly with `--inference.service_task` on the client / `service_name` in the manifest. **`Action name/order mismatch between server policy and this robot`** The hard sync-safety contract: chunk columns map to motors **by order**, so the robot's ordered action keys must exactly equal the policy's `action_feature_names`. This fires when the robot type, motor naming, or rename map differs from the training setup. Use the same robot type (and rename map) the policy was trained with. **`RTC requested but this server/policy does not support it — downgrading to chunk-append`** Informational, not fatal. Enable RTC in the server manifest (`rtc.enabled: true`) and make sure the policy family is RTC-capable (`pi0`, `pi05`, `smolvla`). Otherwise, expect chunk-append behavior (see [RTC over the network](#rtc-over-the-network)). **`server full: N/N sessions active`** The session-open was rejected at capacity. Raise `max_sessions` (shared mode only), or point the robot at another server replica — the rejection includes the current load so orchestration can retry elsewhere.