admin/lerobot

Fork 0

mirror of https://github.com/huggingface/lerobot.git synced 2026-06-18 00:37:10 +00:00

Files

T

Steven Palma fc019d3902 feat(rollout): remote inference draft

2026-06-12 02:01:41 +02:00

59 KiB

Raw Blame History

Decoupled VLA Inference & Edge Control v2: Async Network Inference for `lerobot-rollout`

Status: supersedes the v1 proposal in full. v1 was written against the standalone src/lerobot/async_inference/ prototype, before lerobot-rollout existed. This revision re-grounds the design in the current codebase, keeps v1's decisions that survived contact with it (marked KEPT throughout), reverses the ones that didn't, and adds the safety, multi-tenancy, and operations specifications v1 lacked.

1. Executive Summary

This document specifies a production-grade system for decoupling GPU-bound policy inference from high-frequency robot control, targeting power users running hundreds of robots against centralized GPU clusters. The system keeps v1's Model-as-a-Service (MaaS) paradigm and Zenoh transport, but changes the integration architecture fundamentally:

The client is not a standalone CLI. It is --inference.type=remote, a new InferenceEngine backend inside lerobot-rollout (src/lerobot/rollout/inference/). Every rollout strategy (base, sentry, highlight, dagger, episodic) gets network inference for free — including dataset recording, DAgger pause/resume, Rerun visualization, and safe teardown.
The client is weightless. No policy weights, no policy processors on the edge. --policy.path resolves to a config-only PreTrainedConfig (no weight download) used for pre-flight validation and action ordering.
The server is stateless per request. All RTC chunk state (leftover prefixes, latency tracking, delay computation) lives client-side in the existing ActionQueue/LatencyTracker machinery — the client ships prefixes + a delay hint with each observation. A server crash loses zero control state; reconnects and horizontal scaling are trivial.
Multi-tenancy is engineered, not assumed. The real hazards are stateful processor pipelines and episode-scoped policy state — not predict_action_chunk purity (which holds for ACT/Pi0/Pi0.5/SmolVLA but not diffusion). The server uses per-session processor instances, a chunk-stateless allowlist, and an exclusive serving mode for policies that need it.
The legacy module dies. src/lerobot/async_inference/ (~1,900 lines, pickle-over-gRPC, single-client, four confirmed bugs) is deleted in the same PR that lands the new backend. No deprecation cycle: the module is experimental, its CLI undocumented in the main flow, and every config field has a mapped successor (§13.4).

2. Motivation (unchanged from v1) — KEPT

LeRobot's standard control loop runs policy inference and robot I/O in the same process. This breaks down when:

The policy is too large for edge hardware (Pi0-class models need a dedicated GPU).
Multiple robots need the same policy (redundant GPU allocation per robot).
Inference latency exceeds the control deadline (e.g. 150 ms inference on a 33 ms control tick).

Decoupling solves all three: the edge runs a tight CPU loop; a GPU server performs inference for N clients.

What changed since v1: the local version of this decoupling already shipped. RTCInferenceEngine (src/lerobot/rollout/inference/rtc.py) runs inference in a background thread against a thread-safe ActionQueue with latency-aware chunk merging. The network system is that same architecture with the thread boundary replaced by a network boundary. This is the design's central simplification: reuse, don't reinvent.

3. Gap Analysis: v1 Proposal vs. Modern Codebase

Topic	v1 assumed	Modern reality	Verdict
Client architecture	Standalone robot-client CLI (§5.1 of v1)	`InferenceEngine` ABC seam in `lerobot-rollout` (`rollout/inference/base.py`); strategies are backend-agnostic	Superseded — backend, not CLI
Chunk blending	Configurable aggregation zoo (`weighted_average`, …)	`ActionQueue` replace-with-delay-trim (RTC) / append (non-RTC) (`policies/rtc/action_queue.py:147-217`)	Superseded — drop blending entirely
Latency compensation	Hand-rolled RTT trim (`expired_steps = int(rtt/dt)`, v1 §8.2)	`ActionQueue.merge(..., real_delay, idx_before)` + `LatencyTracker` already do this, validated	Superseded
Multi-tenancy invariant	"`predict_action_chunk()` pure ⇒ safe to share"	Processor state + episode-scoped policy state are the real hazards (§7)	Incomplete — fixed in §8.3
Data logging	Client-side `build_dataset_frame` + `add_frame` sketch (v1 §14)	Recording strategies (sentry/episodic/dagger) already log obs + executed actions	Superseded — free via rollout
MaaS pre-warm, no dynamic loading	✓	Still right; legacy `SendPolicyInstructions` is a pickle/RCE + capacity-planning disaster	KEPT
JPEG observation compression	✓	Still right (§10.1)	KEPT
Status/capability validation before start	✓ (Zenoh queryable)	Still right; extended into a hard sync-safety contract (§8.4)	KEPT, extended
Time-based send threshold (v1 G14)	✓	Adopted as `buffer_time_s`	KEPT
Zenoh pub/sub data plane	✓	Confirmed; QoS corrected (§6.3), control plane moved to queryables, liveliness added	KEPT, hardened
MessagePack serialization	✓	Endorsed (zenoh's `ext` serializer cannot encode numpy); must be version-gated (§10.4)	KEPT, with schema discipline
QoS table (v1 §6.2)	"obs best-effort, actions reliable"	Conflates transport reliability with congestion control; BLOCK on actions is dangerous	Revised (§6.3)
Bugs BUG-1…BUG-4, gaps G1…G14	Listed as work items	Every one resolved structurally by this design (§13.5 mapping)	Resolved by design

4. Critical Pushbacks on v1

Each pushback: claim → evidence → consequence for this design.

P1 — A standalone client duplicates lerobot-rollout. v1 §5.1 assigns the client: observation capture, action execution at frequency, fail-safe, data logging. Every one of those is already owned by rollout strategies and send_next_action (rollout/strategies/core.py:269-304), which tolerates None actions, runs the interpolator, and routes through the canonical robot processors. A standalone client re-implements loop timing, recording, DAgger UX, Rerun, and teardown safety — and then drifts. Consequence: the client is RemoteInferenceEngine, registered as --inference.type=remote next to sync and rtc.

P2 — The aggregation-function zoo fabricates actions no policy predicted. 0.3*old + 0.7*new produces hybrid actions that exist in no policy's output distribution; the logged action becomes unexplainable (bad for the reproducibility story) and the implementation hosted a real lock-release race (BUG-2, async_inference/robot_client.py:236-267). RTC's prefix-conditioned chunk generation is the principled mechanism for smooth chunk transitions; plain append covers non-RTC chunking. Consequence: ActionQueue replace/append are the only two merge semantics. The zoo is deleted.

P3 — "predict_action_chunk pure ⇒ multi-tenant safe" is incomplete. Verified in-tree: (a) RelativeActionsProcessorStep caches _last_state at preprocess (processor/relative_action_processor.py:131) and the postprocessor reads it back (:189) — a shared pipeline across clients is a race; (b) DiffusionPolicy.predict_action_chunk reads self._queues, which only select_action populates (policies/diffusion/modeling_diffusion.py:90-108) — it is not chunk-stateless; (c) SAC/SARM have no predict_action_chunk at all. Consequence: per-session processor instances (mandatory), a chunk-stateless allowlist, serving_mode: exclusive for diffusion-family, refusal at startup for SAC/SARM, and policy.reset() is never called in shared mode (§8.3).

P4 — v1 re-derives latency compensation that already exists, on top of broken clocks. v1 §8 specifies an in-flight RTT dict and manual stale-step trimming. ActionQueue.merge(original, processed, real_delay, idx_before) already trims real_delay stale steps and cross-validates against actions consumed in flight (action_queue.py:219-246). Worse, the legacy code compares wall clocks across machines (robot_client.py:420 stamps time.time() "to compare timestamps across client and server"; policy_server.py:178 compares it) — NTP skew is the same order as the latencies being measured. Consequence: the monotonic iron rule (§11): instants never cross machines; client timestamps are opaque echoed tokens; servers report only durations. delay_steps = ceil((rtt + inference)/dt) is computed client-side from client-local perf_counter samples and shipped per request.

P5 — One-in-flight per client is a correctness requirement, not a tuning choice. At send time the client snapshots idx_before = queue.get_action_index() and the leftover prefixes; merge validates against them. Two in-flight requests carry conflicting snapshots — the second merge corrupts both RTC replace mode and append mode. The local RTC thread is also strictly one-inference-at-a-time; one-in-flight preserves exact parity. Consequence: the worker publishes one observation, waits for its chunk (or timeout), then sends the next. v1 §8.1's out-of-order in-flight dict is dead weight; a late chunk is accepted only if it answers the latest outstanding seq_id, otherwise dropped.

P6 — v1's QoS table conflates transport reliability with congestion behavior. "Reliable delivery for actions" sounds right but the dangerous knob is congestion control: a publisher configured BLOCK on the action topic can stall the server's publish path on one robot's dead uplink (Zenoh blocks up to wait_before_close, then may close the transport). A dropped action chunk is recoverable by design — the client's queue keeps the robot moving and the next chunk replaces it. Consequence (§6.3): actions = reliability=RELIABLE (hop-level) + congestion_control=DROP + express=True + priority=INTERACTIVE_HIGH; observations = DROP + DATA. If WAN loss proves material, upgrade the action topic to Zenoh Advanced Pub/Sub (cache + recovery, zenoh ≥ 1.5) rather than BLOCK.

P7 — Schema-less MessagePack invites silent version drift across a 300-robot fleet. msgpack stays (zenoh's ext serializer cannot encode numpy/dataclasses, and the team's choice stands), but naked msgpack dicts across heterogeneous fleet versions fail at runtime, on the robot. Consequence (§10.4): a packed little-endian attachment header (schema_version, seq_id, episode_id, client_mono_ns — the rmw_zenoh pattern) so routing/correlation never deserializes the body; schema_version negotiated at the session handshake; additive-only evolution; golden codec tests. Protobuf-over-ZBytes is the documented fallback if drift bites in practice.

P8 — "Deterministic rollout reproducibility" is unattainable on real robots. No seed controls hardware, sensor noise, or network jitter; RTC's latency-driven trimming is inherently timing-dependent. Consequence: the contract is fully logged + replayable (§12): recording strategies already persist observations and executed actions; the remote engine adds (session_id, seq_id, episode_id) provenance so client datasets join server audit logs mechanically.

P9 — v1 has no safety specification. "Log a warning when the buffer empties" is not a fail-safe for a 300-robot fleet. Consequence (§9): a staleness bound (max_action_age_s — never execute an action older than X relative to its source observation), an explicit fallback ladder (hold / repeat_last / zero — zero-command required for future velocity-controlled robots), and a DEAD state that triggers the existing strategy shutdown path (return-to-initial-pose, disconnect) via the same shutdown_event mechanism RTC uses (rtc.py:359-360).

P10 — Capacity must be formula-driven, not "a user decision". v1 §4 says clients-per-server "is a user decision". With t = server time per request, r = per-client request rate, H = RTC execution horizon, dt = control period: N_max = min( 0.8 / (r·t), (H·dt/2 − RTT_net) / t ) → ACT @ 20 ms, 1 Hz: ~40 clients/GPU. Pi0 @ 150 ms, 1 Hz: ~5 clients/GPU. 300 robots on Pi0 ≈ 60 GPU pods. Consequence: the manifest carries max_sessions; the server rejects session opens beyond it (with current load in the reply) so clients retry another replica. Micro-batching is deferred — blocked on a real API issue (predict_action_chunk takes a scalar inference_delay; batched clients have different delays) — behind a Scheduler seam so it can land later without redesign (§8.5).

P11 — Discovery ≠ multicast. Zenoh's multicast scouting does not cross WAN, NAT, or most k8s CNIs. Consequence: multicast scouting disabled; clients use static connect.endpoints (DNS name of the router) + gossip; presence and liveness come from Zenoh liveliness tokens (§6.4), not discovery. "Discovery" for a robot fleet is configuration.

5. System Topology

(Diagram unchanged from v1 — the topology survives; transport/QoS/session details in it are superseded by §6.)

Router tier: one or more zenohd routers (k8s Deployment + Service, TLS on 7447). Robots dial out to the router (NAT-friendly: labs only need outbound 7447/443). GPU servers join as peers via cluster DNS.
Server: one process = one (model_repo, revision, dtype, device) on one GPU, pre-warmed from a YAML manifest (KEPT from v1, amended: pin_task: bool — VLA prompts may vary per session unless pinned).
Client: one robot running lerobot-rollout --inference.type=remote. Weightless: config-only policy metadata.
Identity: client_uuid per robot; session_id per connection epoch; both in every log line on both sides.

6. Zenoh Design

All Zenoh claims below were verified against zenoh / zenoh-python 1.x (eclipse-zenoh 1.9.0). Pin: eclipse-zenoh>=1.9,<2.0; keep zenohd on the same minor as the Python binding. Wheels cover manylinux x86_64/aarch64/armv7l/armv6l + macOS — Raspberry Pi edge clients are covered.

6.1 Key-expression schema

@lerobot/<model_id>/<revision>/<task_slug>/<client_uuid>/obs       client → server
@lerobot/<model_id>/<revision>/<task_slug>/<client_uuid>/action    server → client
@lerobot/<model_id>/<revision>/<task_slug>/status                  queryable (capabilities)
@lerobot/<model_id>/<revision>/<task_slug>/session                 queryable (open/validate)
@lerobot/<model_id>/<revision>/<task_slug>/<client_uuid>/reset     queryable (episode boundary)
@lerobot/<model_id>/<revision>/<task_slug>/<client_uuid>/alive     liveliness token (client)
@lerobot/<model_id>/<revision>/<task_slug>/server/alive            liveliness token (server)

Rules (hard, enforced by a sanitize_keyexpr() helper):

Root at the verbatim chunk @lerobot — verbatim chunks are only matched by identical chunks, so third-party ** subscribers on a shared router can never scrape the tree.
Sanitize every user-supplied segment (model ids, task strings, uuids): non-empty, no * $ ? # /, no leading/trailing/double /. A task string containing / must be slugified before it becomes a key chunk.
Server subscribes with a single-depth wildcard (.../*/obs) — never ** (it would also match status, alive, …).
v1's cluster/experiment prefix segments are dropped from the key schema; they return as free-form tags metadata in the session handshake (telemetry/labeling, not routing). Routing topology belongs to deployment (which router you dial), not to key depth.

6.2 Data plane vs. control plane (the rmw_zenoh split)

Data plane = pub/sub (KEPT from v1): observations up, action chunks down, correlated by seq_id in attachments (§10.4). Pub/sub rather than query-per-inference because: a timed-out query's late reply is dropped by the transport (wasted inference), whereas a late pub/sub chunk is still mergeable if it answers the latest outstanding seq; and pub/sub leaves room for server-initiated messages (drain notices). The one-in-flight discipline (P5) is enforced in the client worker, not by the transport.
Control plane = queryables (request/reply with explicit timeouts; the pattern rmwzenoh uses for ROS 2 services): status (pre-flight capability fetch, 2 s timeout), session (open/validate → ack with capabilities + session_id), reset (episode boundary — _acknowledged, so episodic strategies know the server-side episode state is clean). Always pass an explicit timeout to session.get() — the config default is 10 s, far too long for our watchdogs.
Episode ordering: under one-in-flight there is no obs/reset race window in the data plane, but as belt-and-braces the first observation of each episode also carries episode_start=True + the new episode_id in its header.

6.3 QoS (revised from v1 §6.2 — see P6)

Topic	reliability	congestion_control	express	priority	Why
`obs`	default	DROP	false	DATA	Intentional drop already happened at the client's one-slot holder; if the uplink stalls, dropping a frame protects the control loop.
`action`	RELIABLE	DROP (never BLOCK)	true	INTERACTIVE_HIGH	Hop-level reliability over TCP; express skips batching for the small (4–50 KB) latency-critical payload; DROP so one dead robot uplink can never stall the server's publish path. Chunk loss is recoverable: the client buffer rides through it.
control queryables	RELIABLE	default	—	—	Correctness over latency; explicit timeouts bound them.

Upgrade path if WAN chunk loss proves material: AdvancedPublisher/AdvancedSubscriber (zenoh ≥ 1.5) with a small cache + heartbeat-based recovery on the action topic only. Hop-by-hop RELIABLE is not end-to-end reliability — Zenoh has no broker persistence; a disconnected subscriber's data is gone. The design assumes this (client state machine, §9).

6.4 Liveliness (presence + watchdogs)

Client declares a liveliness token on .../<client_uuid>/alive. The server liveliness-subscribes with history=True: token appear → ensure session state; token drop → GC the session (mailbox, processor instances) after a grace period.
Server declares .../server/alive. The client liveliness-subscribes: on drop → treat as RECONNECTING (§9), hold/fallback per config, re-run the status/session handshake when the token reappears.
Tune the transport lease down from its default so ungraceful-death detection is seconds, not tens of seconds (verify the default in the pinned version; it is config transport/link/tx/lease).
Liveliness cannot detect a hung-but-connected server. The client's per-request timeout (request_timeout_s) is the authoritative watchdog — this is the structural fix for legacy BUG-3 (no deadlines on GetActions).

6.5 Threading constraints (zenoh-python facts that shape both processes)

No asyncio API in zenoh-python — both client and server are thread-based. This matches the existing RTC engine pattern exactly.
Each callback-based subscriber spawns a dedicated Python thread; blocking Zenoh calls inside callbacks are disallowed. Callbacks must be deposit-only (write a slot, set an event, return).
Channel handlers (FifoChannel, RingChannel) are Rust-side; try_recv() polls without spawning Python threads. RingChannel(1) is native latest-only semantics.
No zero-copy path for our payloads (SHM API is @_unstable and same-host-only; ZBytes copy behavior undocumented). At ~200 KB × a few Hz per robot, one memcpy is irrelevant.

6.6 Router deployment

zenohd official image as a k8s Deployment (1–N replicas; routers mesh and reroute around failures) behind a LoadBalancer/NodePort Service exposing TLS 7447. No official Helm chart exists — roll-your-own manifests.
scouting.multicast.enabled: false; scouting.gossip.enabled: true; clients/servers use static connect.endpoints.
Auth: mTLS per robot (transport.link.tls with enable_mtls) + router ACL keyed on cert_common_names: a robot's cert may only put to @lerobot/**/<its-uuid>/obs and receive on .../<its-uuid>/action. Caveat (flagged): ACL config reloads require a router restart — plan cert/ACL changes as rolling router restarts.
Security review input: the third-party Zenoh protocol security analysis (Census Labs, 2025) should be read before exposing 7447 publicly.

7. The Statelessness Boundary (the load-bearing section)

Where the network cut goes. The local RTC pipeline is:

obs (robot-processed dict)
  → build_dataset_frame(hw_features, obs, "observation")        CLIENT  (cheap, hardware-coupled)
─────────────────────────── network ───────────────────────────
  → prepare_observation_for_inference(...)                      SERVER  (policy-coupled, heavy)
  → per-session preprocessor(...)                               SERVER  (stateful within the request)
  → policy.predict_action_chunk(obs, inference_delay, prefix)   SERVER  (pure for allowlisted policies)
  → per-session postprocessor(...)                              SERVER  (reads state cached at preprocess)
─────────────────────────── network ───────────────────────────
  → ActionQueue.merge(original, processed, real_delay, idx_before)   CLIENT

Three consequences:

The server needs no cross-request state. RelativeActionsProcessorStep writes _last_state at preprocess and the postprocessor reads it back within the same request. Per-session pipeline instances + one-request-at-a-time-per-session give correctness with zero persistent state.
RTC state stays client-side, exactly where RTCInferenceEngine already keeps it. Each request ships: inference_delay_steps = ceil(L_max/dt) (from the client LatencyTracker, whose samples are full network-inclusive cycle times — RTT compensation falls out for free), prefix_model = queue.get_left_over()[:H], and prefix_robot = queue.get_processed_left_over()[:H] (needed for server-side relative-prefix re-anchoring, mirroring rtc.py:287-305). The response returns both the model-space and robot-space chunks because merge needs both. ≤ execution_horizon × action_dim float32 each — a few hundred bytes.
G9 dies structurally. No bespoke client resize (F.interpolate in legacy helpers.py), no client-side normalization. Clients ship native camera resolution; the server's canonical processor path does everything — serve-time preprocessing is byte-identical to train-time.

What the server does hold (and what it means):

Per-session processor instances (cheap; normalization stat tensors shared read-only).
Per-session episode counter + stats. Episode reset = reset the session's pipelines, clear its mailbox. policy.reset() is never called in shared mode — it is global to the shared policy instance and unnecessary for chunk-pure policies (ACT's ensembler and Pi0/SmolVLA's queues live in select_action, not predict_action_chunk — verified).
Policies that are not chunk-pure get serving_mode: exclusive (§8.3).

8. The Inference Server: `lerobot-policy-server`

New package src/lerobot/policy_server/; console script lerobot-policy-server --manifest manifest.yaml.

8.1 Process model — KEPT from v1, amended

One process = one model+task on one GPU, loaded and warmed at startup (warmup_inferences dummy forwards; covers torch.compile). Multi-GPU nodes run N processes (CUDA_VISIBLE_DEVICES pinning). Dynamic model loading (SendPolicyInstructions) is rejected: pickle/RCE surface, arbitrary-download surface, and it destroys capacity planning. Amendment: pin_task: false (default) lets VLA clients set the task per session; pin_task: true rejects mismatched tasks at session open.

8.2 Concurrency (pure threads — no asyncio in zenoh-python)

zenoh subscriber (.../*/obs)          inference worker (1 thread, owns GPU)
  deposit-only callback:                loop:
  slots[client_uuid] = sample   ──►       pick next session with pending obs (RR ring)
  (per-client latest-only)                decode JPEG → per-session preprocess
                                          predict_action_chunk(delay, prefix)
control queryables (status/session/      per-session postprocess → encode
  reset): validate, mutate session        publisher.put(.../<uuid>/action)
  registry, reply                       (publishing from the worker thread is fine)

Per-client latest-only mailbox: a wildcard subscriber with a deposit-only callback writing per-client slots (scales to dynamic fleets), or — when the manifest enumerates clients — one RingChannel(1) subscriber per client polled via try_recv(). Either way: newest observation wins; a superseded request is counted (superseded_seqs in the next response) so drops are visible. This deletes legacy BUG-4 (observations_similar + must_go) by construction — the client decides when to request; the server never second-guesses observation content.
Single inference worker: torch releases the GIL inside forward, callbacks stay responsive. Strict round-robin over sessions with pending observations: each gets exactly one inference per cycle; starvation is structurally impossible. Overload degrades into longer cycle times → larger (but correct) client delay_steps → eventually the client staleness bound trips and the robot holds — safe by construction.

8.3 Chunk-stateless allowlist and serving modes

At startup the server classifies the loaded policy:

Class	Policies (verified)	Mode
chunk-stateless	ACT, Pi0, Pi0.5, SmolVLA (and any policy whose `predict_action_chunk` touches no instance state)	`shared`: N sessions, per-session pipelines, `policy.reset()` never called
chunk-stateful	Diffusion family (`predict_action_chunk` reads `select_action`-fed `self._queues`)	`exclusive`: `max_sessions=1` enforced; episode reset additionally calls `policy.reset()`; second session open → rejected with a self-explanatory error
no chunk API	SAC, SARM	refused at startup

Implemented as a registry in policy_server/validation.py; the cleaner follow-up is a supports_stateless_chunking class attribute on PreTrainedPolicy (needs a pass over policy families — roadmap §14).

8.4 Session open & capability validation (fail fast, fail loud)

session queryable payload: client_uuid, policy_type, fps, feature summary (post-rename observation feature names + shapes, ordered action keys), schema_version, RTC intent, tags. Checks:

Check	Rule	On mismatch
Action names and order	must equal server's `action_feature_names` exactly	hard reject — this is the sync-safety contract mapping chunk columns to motors
Camera names	client set must cover `policy.config.input_features` image keys	hard reject
Resolution	any H×W accepted (server resizes canonically)	warn if aspect ratio differs from training
State dim	flattened dim must match	hard reject
`schema_version`	client within server's supported range	hard reject
fps	vs. manifest `trained_fps`	warn (reject only when `strict_fps: true`)
Task	when `pin_task: true`, must equal `default_task`	reject
RTC	client RTC requires policy RTC kwargs support	downgrade to append mode + warning
Capacity	`active_sessions < max_sessions`	reject with current load → client retries another replica

Reply: session_id, model info (repo, revision — consider a checkpoint hash, §15), action_feature_names, chunk_size, trained_fps, supports_rtc, serving_mode, warmed_up, schema_version, warnings. rename_map is applied client-side so the wire format is canonical policy-feature keys across heterogeneous robots (also a prerequisite for future batching).

8.5 Scheduler seam (micro-batching later, not in v1)

The worker calls a Scheduler.select(ready: list[Session]) -> list[Session]; v1 ships RoundRobin (return ready[:1]). Cross-session batching is blocked on the policy API (inference_delay is scalar; batched clients have different delays/prefixes) — when that lands, a MicroBatch scheduler groups same-shape sessions. The seam costs nothing now and prevents a redesign later.

8.6 Manifest

model:
  {
    repo_or_path: lerobot/pi0_towels,
    revision: main,
    dtype: bfloat16,
    device: cuda,
  }
default_task: "fold the towel"
pin_task: false
serving_mode: shared # forced to exclusive for chunk-stateful policies
max_sessions: 5 # from the §P10 formula: Pi0 @150ms, 1 Hz refresh
warmup_inferences: 2
strict_fps: false
zenoh:
  connect_endpoints: ["tls/router.gpu-cluster.internal:7447"]
  tls:
    {
      connect_certificate: ...,
      connect_private_key: ...,
      root_ca_certificate: ...,
    }
health_port: 9100 # HTTP health + Prometheus metrics
debug: { capture_dir: null, capture_max: 256 }

Draccus dataclass in policy_server/manifest.py; YAML via --manifest, individual overrides via CLI.

9. The Edge Client: `RemoteInferenceEngine`

New file src/lerobot/rollout/inference/remote.py, registered @InferenceEngineConfig.register_subclass("remote").

9.1 Threading model

Thread	Role
Main (strategy loop)	`notify_observation(obs)` → lock-protected latest-only slot (identical to `rtc.py` `_obs_holder`). `get_action()` → `ActionQueue.get()` + staleness check. Never any I/O. Structurally fixes legacy BUG-1 (blocking send inside the 33 ms loop).
Network worker (1 daemon thread)	Cycle: wait until `queue_remaining·dt ≤ buffer_time_s` and active → snapshot `idx_before`, prefixes, `delay_steps = ceil(L_max/dt)` → encode (JPEG q=`jpeg_quality`) → `publisher.put(obs, attachment=header)` → await chunk on the action subscriber channel (timeout `request_timeout_s`) → `merge(original, processed, ceil(L/dt), idx_before)` → `latency_tracker.add(L)`. Owns the state machine, reconnects, and control queries. One-in-flight (P5).
Zenoh action subscriber	`FifoChannel(2)` handler drained by the worker (no Python callback thread on the hot path); liveliness subscriber callback is deposit-only (sets an event).

Reused unchanged: ActionQueue (policies/rtc/action_queue.py), LatencyTracker, ActionInterpolator (lives in strategies — interpolation_multiplier works with remote for free). Deleted concepts: aggregation zoo, observations_similar, must_go, TimedObservation/TimedAction pickles.

9.2 Fail-safe state machine

              ok                              no chunk for degraded_after_s
 CONNECTING ─────► STREAMING ───────────────────────────────► DEGRADED
   │ ▲               ▲   │ queue empty OR max_action_age_s hit     │
   │ │ backoff,      │   └───────────────────────────────────► STALLED ◄──┘
   │ │ re-handshake  │ first successful merge                      │
   │ └─ RECONNECTING ◄── timeout streak / server liveliness drop ◄─┘
   │        │ offline > max_offline_s, capability/schema mismatch, auth failure
   └──────► DEAD  (failed=True → shutdown_event → strategy teardown: return-to-initial-pose)

DEGRADED: requests failing but the queue still holds actions — the robot keeps executing; chunks are the fault-tolerance buffer (1–3 s of coverage makes blips and clean server drains invisible).
STALLED: queue empty or staleness bound hit → apply fallback: hold (get_action → None; send_next_action already tolerates it), repeat_last, or zero (required for velocity-controlled robots, where "send nothing" means "keep last velocity").
Staleness bound (sync safety): every merge records (chunk_start_index, t_send); get_action refuses any action whose source observation is older than max_action_age_s (default 3.0 s ≈ 90 steps @ 30 fps). Bounds open-loop execution after a network stall.
DEAD: only after max_offline_s (default 60 s) or a hard contract violation (capability/schema mismatch on reconnect — e.g. the server restarted with a different model; never execute wrong-model chunks). Uses the exact mechanism RTC uses (failed=True + global shutdown_event) so existing teardown runs unchanged.
Watchdog layering: per-request timeout (hung server — the BUG-3 fix) → server liveliness token (dead server/router) → staleness bound (the robot-side invariant that holds regardless of why data stopped).
Pause/resume (DAgger): pause() stops the worker publishing (slot keeps refreshing, ignored); queue intact — parity with RTCInferenceEngine.pause. DAgger's existing interpolator.reset(); engine.reset(); engine.resume() sequence works unchanged.
reset() (episode boundary): clear ActionQueue + staleness bookkeeping, bump episode_id, fire the acked reset query (1 s timeout, failure logged — the server has nothing it must do thanks to per-request statelessness), flag episode_start on the next observation. LatencyTracker intentionally survives reset (latency is episode-invariant; parity with local RTC).
ready = session opened ∧ capabilities validated ∧ server warmed_up. First-chunk gating is implicit (get_action → None until the first merge).

9.3 Weightless client — exact integration changes

rollout/context.py: PolicyContext.{policy, preprocessor, postprocessor} become | None. For remote configs, skip step 1 (weight load / PEFT / .to(device) / torch.compile / init_rtc_processor) and step 6 (make_pre_post_processors). Verified safe: strategies only consume ctx.policy.inference. Keep steps 2–5 (robot processors, hardware, features, dataset) — they are robot-derived. Keep the visual pre-flight check (context.py:309-324): --policy.path already loads config-only (rollout/configs.py:324-328, no weight download) and failing before dialing the server is free. use_torch_compile / explicit --device → warn-and-ignore for remote.
rollout/inference/factory.py: signature loosens to policy: PreTrainedPolicy | None (+ policy_config: PreTrainedConfig); sync/rtc branches guard policy is None; the remote branch lazy-imports (eclipse-zenoh stays an optional extra).
The authoritative validation moves to session open (§8.4); the local check becomes a fast-fail convenience.

9.4 Config

@InferenceEngineConfig.register_subclass("remote")
@dataclass
class RemoteInferenceConfig(InferenceEngineConfig):
    connect_endpoint: str = "tls/localhost:7447"   # zenoh router endpoint
    tls_cert: str | None = None; tls_key: str | None = None; tls_ca: str | None = None
    client_uuid: str = ""                # "" → uuid4 at start()
    jpeg_quality: int = 90               # 0 = raw (LAN/debug)
    buffer_time_s: float = 0.5           # send next obs when queue playback ≤ this (v1 G14) — KEPT
    max_action_age_s: float = 3.0        # staleness bound (safety)
    degraded_after_s: float = 1.0
    request_timeout_s: float = 5.0
    reconnect_initial_backoff_s: float = 0.5
    reconnect_max_backoff_s: float = 10.0
    max_offline_s: float = 60.0
    fallback: FallbackBehavior = FallbackBehavior.HOLD   # hold | repeat_last | zero
    rtc: RTCConfig = field(default_factory=RTCConfig)    # enabled → replace mode; horizon caps prefix
    tags: dict[str, str] = field(default_factory=dict)   # ex-cluster/experiment labels

# Remote RTC + sentry recording (the reproducibility path)
lerobot-rollout \
    --strategy.type=sentry \
    --policy.path=lerobot/pi0_towels \                 # config-only: no weights downloaded
    --inference.type=remote \
    --inference.connect_endpoint=tls/router.gpu-cluster.internal:7447 \
    --inference.rtc.execution_horizon=10 \
    --robot.type=so100_follower --robot.port=/dev/ttyACM0 \
    --robot.cameras="{front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
    --dataset.repo_id=user/rollout_fleet_a --dataset.single_task="fold the towel"

10. Wire Schema

10.1 Payload anatomy & rates — KEPT (JPEG) with numbers

Upstream per request: joints (24–128 B) + JPEG frames (480p q90 ≈ 40–90 KB each; 720p ≈ 110–230 KB) + RTC prefixes (≤ a few KB) → 60–450 KB depending on cameras. Downstream: 2 × chunk_size × action_dim × 4 B + metadata → 3–50 KB. Effective request rate is self-clocked by buffer_time_s to ~1–4 Hz per robot (not the 30 Hz control rate). 300 robots ≈ 0.3–10 Mbps each — the wire is never the bottleneck; bandwidth budgeting is about camera count/resolution, and each GPU pod only ever sees its own ≤ max_sessions clients. Zenoh fragments >64 KiB payloads transparently; multi-MB messages are fine.

10.2 Attachment header (fixed-layout, packed little-endian — parsed without touching the body)

Field	Type	Notes
`schema_version`	u16	negotiated at session open
`msg_type`	u8	OBS / CHUNK / EVENT
`seq_id`	u64	per-session monotonic; echoed in the chunk
`episode_id`	u32	bumped by `reset()`
`client_mono_ns`	i64	client `monotonic_ns()`; opaque to the server, echoed back
`session_epoch`	u32	bumped per (re)connect; stale-epoch chunks dropped

10.3 msgpack bodies

ObservationMsg (client → server): state: {names_ref, data: f32 LE bytes}, images: {name: {codec: jpeg|raw, bytes, (h,w,c) if raw}}, task: str, inference_delay_steps: int, prefix_model: tensor?, prefix_robot: tensor? (tensors = raw LE bytes + dtype + shape), episode_start: bool. ActionChunkMsg (server → client): seq_id_echo, client_mono_ns_echo, chunk_model: tensor, chunk_robot: tensor, queue_wait_ms: f32, inference_ms: f32, superseded_seqs: u32, server_load: f32. Status / SessionOpen / SessionAck / ResetMsg: as specified in §8.4.

10.4 Schema discipline (P7)

schema_version gates at handshake; evolution is additive-only (new optional msgpack keys; unknown keys ignored); attachment layout changes require a version bump; golden codec round-trip tests (tensor exactness, JPEG RGB-channel-order regression — a silent BGR swap poisons every VLA in the fleet) are part of the test suite. No pickle anywhere — KEPT from v1 and now structural: nothing in the schema can carry code.

11. Latency Budget & the Clock Iron Rule

Stage	LAN	WAN (50 ms RTT)
JPEG encode ×3 (edge CPU)	2–9 ms	2–9 ms
Serialize	<1 ms	<1 ms
Uplink (tx + ½RTT)	~2 ms	~54 ms
Server queue wait	0 → 1×inference	0 → 1×inference
Decode + canonical preprocess	4–10 ms	4–10 ms
Inference	15–150 ms	15–150 ms
Postprocess + downlink + merge	~2 ms	~27 ms
Total (Pi0-class)	~110–175 ms	~190–250 ms

Inference is 60–85 % of end-to-end on LAN; the entire transport+serialization stack is <10 ms. WAN adds propagation + uplink bandwidth — identical under any transport. At 30 fps this lands delay_steps ≈ 4–8, comfortably inside RTC execution horizons: WAN degrades smoothness parameters, never correctness. This table is the standing answer to transport-performance bikeshedding.

Clock iron rule (P4): wall-clock instants never cross machines. Client stamps monotonic_ns, the server echoes it opaquely; RTT = now − echo. The server reports only durations (queue_wait_ms, inference_ms) measured on its own monotonic clock; network_time = RTT − queue_wait − inference for diagnostics. The schema has no field in which a foreign wall-clock instant can be compared — the legacy time.time() bug is unrepresentable.

12. Reproducibility & Audit (P8)

The contract is fully logged + replayable, not "deterministic":

Client = source of truth. Recording strategies already persist observations + executed actions to LeRobotDataset. The remote engine logs, per executed action, the (session_id, seq_id, episode_id) of its source chunk plus the echoed queue_wait_ms/inference_ms (dataset-extras columns are a follow-up; client logs in v1).
Server audit line per request (structured JSON): {ts, session_id, client_uuid, seq_id, episode_id, queue_wait_ms, inference_ms, chunk_range, superseded_seqs, outcome}.
Optional bounded capture: debug.capture_dir writes a ring of request/response pairs (safetensors) for byte-exact offline replay through the same server pipeline.
Runbook — "robot #217 stuttered at 14:03": (1) Grafana session_staleness{client="217"} — spike ⇒ server side, flat ⇒ client/network. (2) Server side: audit lines — queue_wait_ms rising across all sessions ⇒ overloaded replica (check active_sessions vs max_sessions); superseded_seqs streak on 217 only ⇒ that client over-requesting; outcome=error ⇒ adjacent stack trace. (3) Client side: state-machine transitions + reconnects in the client log; dataset rows show which seq's chunk was executing and where None ticks occurred. Every hop shares (session_id, seq_id) — the join is mechanical.

13. Integration & Migration Plan

13.1 New

Path	Content
`src/lerobot/policy_server/{__init__,schema,codec,manifest,session,scheduler,validation,server}.py`	wire schema constants, msgpack/attachment codecs, manifest dataclasses, `Session` + mailbox, `Scheduler` seam, capability rules + chunk-stateless registry, zenoh servicer + inference worker + drain + HTTP health/metrics
`src/lerobot/rollout/inference/remote.py`	`RemoteInferenceEngine` (~600 lines; mirrors `rtc.py` structure)
`src/lerobot/scripts/lerobot_policy_server.py` + `[project.scripts]` entry	thin `main()`
`docker/Dockerfile.policy-server`	CUDA runtime base + uv; manifest via ConfigMap
`docs/source/remote_inference.mdx` (+ `_toctree.yml`)	replaces `async.mdx`

13.2 Modified

rollout/inference/factory.py (config + Optional-typed signature + lazy import) · rollout/context.py (weightless branch) · rollout/inference/__init__.py · scripts/lerobot_rollout.py docstring · pyproject.toml: [async] extra becomes eclipse-zenoh>=1.9,<2.0 + msgpack (grpcio/matplotlib leave it; grpcio remains under [hilserl]/dev for the RL stack).

13.3 Removed — same landing PR

src/lerobot/async_inference/ · tests/async_inference/ · docs/source/async.mdx + its _toctree.yml entry · the AsyncInference service + Observation/Actions/PolicySetup messages from src/lerobot/transport/services.proto (regenerate pb2; LearnerService untouched — transport/ is shared with HIL-SERL (src/lerobot/rl/); the RL test suite gates this change).

13.4 Legacy config → successor mapping

Legacy (`RobotClientConfig`/`PolicyServerConfig`)	Successor
`server_address`	`--inference.connect_endpoint` (zenoh router)
`policy_type`, `pretrained_name_or_path`	`--policy.path` (config-only) + server manifest
`chunk_size_threshold` (0–1 ratio)	`--inference.buffer_time_s` (seconds)
`actions_per_chunk`	server manifest (validated at session open)
`aggregate_fn_name` + `AGGREGATE_FUNCTIONS`	dropped — `ActionQueue` replace/append
`policy_device`, `client_device`	dropped — server concern / chunks arrive CPU f32
`debug_visualize_queue_size`	dropped — Rerun (`--display_data`) + engine stats
`PolicyServerConfig.{host,port}`	manifest `zenoh.connect_endpoints`
`inference_latency`, `obs_queue_timeout`	dropped — latency client-measured; no server obs queue
`SendPolicyInstructions`	dropped — MaaS manifest + session validation
`observations_similar` / `must_go`	dropped — latest-only slots + client send gate
pickle envelopes	dropped — msgpack + attachment headers

13.5 Legacy bugs/gaps → structural resolution

BUG-1 → worker thread owns all I/O. BUG-2 → aggregation deleted; ActionQueue is internally locked. BUG-3 → per-request timeout + liveliness. BUG-4 → client-side send gating; server newest-wins. G1 → per-session registry. G2 → manifest. G4 → msgpack+attachments. G5 → monotonic echo + delay_steps. G7 → recording strategies. G8 → mTLS + ACL. G9 → server-side canonical processors. G11 → status queryable. G12 → Prometheus + audit logs. G13 → lerobot-policy-server console script. G14 → buffer_time_s.

13.6 Tests

Unit: codec round-trips (tensor exact; JPEG RGB-order regression), capability-validation matrix (§8.4 as parametrized cases), scheduler fairness + newest-wins supersession (mock policy with configurable sleep), manifest parsing, key-expr sanitization.
Loopback integration (CPU, fast CI): client+server in one process over zenoh peer-to-peer (or a localhost zenohd started by the fixture), tiny-ACT, fake 2-camera robot, N=8 concurrent sessions. The headline regression: two sessions with different joint states must not cross-contaminate RelativeActionsProcessorStep postprocessing — the test that proves the multi-tenancy claim.
Chaos: kill the server mid-episode → client returns None, never raises into the control loop, failed stays False within max_offline_s, resumes on restart; docker kill zenohd → liveliness flap → safe state → re-handshake (explicitly tests re-declaration behavior, flagged unverified upstream); SIGTERM drain → in-flight chunk completes, clients reconnect invisibly.
Golden parity: remote RTC vs local RTCInferenceEngine on identical observation sequences → byte-identical merged queues (the re-anchoring contract test). Gate for any real-robot remote-RTC use.

14. Roadmap

PR1 — schema & codecs (no torch deps): policy_server/{schema,codec,manifest}.py, key-expr sanitizer, golden codec tests.
PR2 — server core: session registry, scheduler, validation/allowlist, inference worker with mock policy, loopback harness.
PR3 — client engine: RemoteInferenceEngine, factory/context weightless integration, loopback integration + chaos + golden-parity tests.
PR4 — ops & docs: Dockerfile, health/metrics, drain, ACL examples, remote_inference.mdx, rollout docstring.
Landing PR — legacy deletion: remove async_inference/ + tests + docs + proto service (RL suite gates), [async] extra swap.
Pre-release field validation: one real robot on a lossy network (watchdog default tuning); JPEG q90 vs raw A/B on one policy (train/serve shift).
Future: micro-batching (needs per-sample inference_delay across policy families), client-side downscale-to-policy-resolution (config-only shapes make it possible), Advanced Pub/Sub on the action topic, per-robot quotas, dataset provenance columns, supports_stateless_chunking attribute upstreamed to policy classes.

15. Open Risks

Risk	Mitigation / decision needed
Re-anchoring parity (server-side relative-prefix re-anchor vs `rtc.py`)	Golden parity test (§13.6) is a hard gate before robot use; likely failure mode is normalizer dtype/device drift
First-chunk over-trim when idle: `merge` trims `ceil(L/dt)` even when nothing was consumed (queue empty at episode start) — wasteful at network latencies (600 ms ⇒ 18 steps)	Proposed clamp `real_delay = min(real_delay, last_index - idx_before)` touches the shared `ActionQueue` used by local RTC — needs sign-off + regression tests
JPEG train/serve distribution shift	Unmeasured; A/B before locking q90 default (roadmap §14.6)
Watchdog defaults untuned (`request_timeout_s=5`, `degraded_after_s=1`, `max_action_age_s=3`)	Field validation on wired and Wi-Fi; consider named profiles
Capability check can pass while semantics differ (different finetune, different normalization stats, identical feature names)	Add checkpoint hash/revision pinning to SessionAck — decide in PR2
zenoh-python long-session maturity: re-declaration after router restart partially verified; SHM unstable; no asyncio	Chaos tests own this; thread-based design avoids the asyncio gap entirely
Router ACL reload requires restart	Operational runbook: cert/ACL changes = rolling router restart
`fallback=zero` has no consumer until velocity actions land in rollout (only `.pos` features routed today)	Validate the enum against robot capabilities when velocity support lands
Per-client mailbox memory under fleet-scale wildcard subscription	One decoded-obs slot per client is small; add an LRU GC tied to liveliness drops

59 KiB Raw Blame History Unescape Escape

Decoupled VLA Inference & Edge Control v2: Async Network Inference for lerobot-rollout