59 KiB
Decoupled VLA Inference & Edge Control v2: Async Network Inference for lerobot-rollout
Status: supersedes the v1 proposal in full. v1 was written against the standalone
src/lerobot/async_inference/prototype, beforelerobot-rolloutexisted. This revision re-grounds the design in the current codebase, keeps v1's decisions that survived contact with it (marked KEPT throughout), reverses the ones that didn't, and adds the safety, multi-tenancy, and operations specifications v1 lacked.
1. Executive Summary
This document specifies a production-grade system for decoupling GPU-bound policy inference from high-frequency robot control, targeting power users running hundreds of robots against centralized GPU clusters. The system keeps v1's Model-as-a-Service (MaaS) paradigm and Zenoh transport, but changes the integration architecture fundamentally:
- The client is not a standalone CLI. It is
--inference.type=remote, a newInferenceEnginebackend insidelerobot-rollout(src/lerobot/rollout/inference/). Every rollout strategy (base, sentry, highlight, dagger, episodic) gets network inference for free — including dataset recording, DAgger pause/resume, Rerun visualization, and safe teardown. - The client is weightless. No policy weights, no policy processors on the edge.
--policy.pathresolves to a config-onlyPreTrainedConfig(no weight download) used for pre-flight validation and action ordering. - The server is stateless per request. All RTC chunk state (leftover prefixes, latency tracking, delay computation) lives client-side in the existing
ActionQueue/LatencyTrackermachinery — the client ships prefixes + a delay hint with each observation. A server crash loses zero control state; reconnects and horizontal scaling are trivial. - Multi-tenancy is engineered, not assumed. The real hazards are stateful processor pipelines and episode-scoped policy state — not
predict_action_chunkpurity (which holds for ACT/Pi0/Pi0.5/SmolVLA but not diffusion). The server uses per-session processor instances, a chunk-stateless allowlist, and an exclusive serving mode for policies that need it. - The legacy module dies.
src/lerobot/async_inference/(~1,900 lines, pickle-over-gRPC, single-client, four confirmed bugs) is deleted in the same PR that lands the new backend. No deprecation cycle: the module is experimental, its CLI undocumented in the main flow, and every config field has a mapped successor (§13.4).
2. Motivation (unchanged from v1) — KEPT
LeRobot's standard control loop runs policy inference and robot I/O in the same process. This breaks down when:
- The policy is too large for edge hardware (Pi0-class models need a dedicated GPU).
- Multiple robots need the same policy (redundant GPU allocation per robot).
- Inference latency exceeds the control deadline (e.g. 150 ms inference on a 33 ms control tick).
Decoupling solves all three: the edge runs a tight CPU loop; a GPU server performs inference for N clients.
What changed since v1: the local version of this decoupling already shipped. RTCInferenceEngine (src/lerobot/rollout/inference/rtc.py) runs inference in a background thread against a thread-safe ActionQueue with latency-aware chunk merging. The network system is that same architecture with the thread boundary replaced by a network boundary. This is the design's central simplification: reuse, don't reinvent.
3. Gap Analysis: v1 Proposal vs. Modern Codebase
| Topic | v1 assumed | Modern reality | Verdict |
|---|---|---|---|
| Client architecture | Standalone robot-client CLI (§5.1 of v1) | InferenceEngine ABC seam in lerobot-rollout (rollout/inference/base.py); strategies are backend-agnostic |
Superseded — backend, not CLI |
| Chunk blending | Configurable aggregation zoo (weighted_average, …) |
ActionQueue replace-with-delay-trim (RTC) / append (non-RTC) (policies/rtc/action_queue.py:147-217) |
Superseded — drop blending entirely |
| Latency compensation | Hand-rolled RTT trim (expired_steps = int(rtt/dt), v1 §8.2) |
ActionQueue.merge(..., real_delay, idx_before) + LatencyTracker already do this, validated |
Superseded |
| Multi-tenancy invariant | "predict_action_chunk() pure ⇒ safe to share" |
Processor state + episode-scoped policy state are the real hazards (§7) | Incomplete — fixed in §8.3 |
| Data logging | Client-side build_dataset_frame + add_frame sketch (v1 §14) |
Recording strategies (sentry/episodic/dagger) already log obs + executed actions | Superseded — free via rollout |
| MaaS pre-warm, no dynamic loading | ✓ | Still right; legacy SendPolicyInstructions is a pickle/RCE + capacity-planning disaster |
KEPT |
| JPEG observation compression | ✓ | Still right (§10.1) | KEPT |
| Status/capability validation before start | ✓ (Zenoh queryable) | Still right; extended into a hard sync-safety contract (§8.4) | KEPT, extended |
| Time-based send threshold (v1 G14) | ✓ | Adopted as buffer_time_s |
KEPT |
| Zenoh pub/sub data plane | ✓ | Confirmed; QoS corrected (§6.3), control plane moved to queryables, liveliness added | KEPT, hardened |
| MessagePack serialization | ✓ | Endorsed (zenoh's ext serializer cannot encode numpy); must be version-gated (§10.4) |
KEPT, with schema discipline |
| QoS table (v1 §6.2) | "obs best-effort, actions reliable" | Conflates transport reliability with congestion control; BLOCK on actions is dangerous | Revised (§6.3) |
| Bugs BUG-1…BUG-4, gaps G1…G14 | Listed as work items | Every one resolved structurally by this design (§13.5 mapping) | Resolved by design |
4. Critical Pushbacks on v1
Each pushback: claim → evidence → consequence for this design.
P1 — A standalone client duplicates lerobot-rollout.
v1 §5.1 assigns the client: observation capture, action execution at frequency, fail-safe, data logging. Every one of those is already owned by rollout strategies and send_next_action (rollout/strategies/core.py:269-304), which tolerates None actions, runs the interpolator, and routes through the canonical robot processors. A standalone client re-implements loop timing, recording, DAgger UX, Rerun, and teardown safety — and then drifts. Consequence: the client is RemoteInferenceEngine, registered as --inference.type=remote next to sync and rtc.
P2 — The aggregation-function zoo fabricates actions no policy predicted.
0.3*old + 0.7*new produces hybrid actions that exist in no policy's output distribution; the logged action becomes unexplainable (bad for the reproducibility story) and the implementation hosted a real lock-release race (BUG-2, async_inference/robot_client.py:236-267). RTC's prefix-conditioned chunk generation is the principled mechanism for smooth chunk transitions; plain append covers non-RTC chunking. Consequence: ActionQueue replace/append are the only two merge semantics. The zoo is deleted.
P3 — "predict_action_chunk pure ⇒ multi-tenant safe" is incomplete.
Verified in-tree: (a) RelativeActionsProcessorStep caches _last_state at preprocess (processor/relative_action_processor.py:131) and the postprocessor reads it back (:189) — a shared pipeline across clients is a race; (b) DiffusionPolicy.predict_action_chunk reads self._queues, which only select_action populates (policies/diffusion/modeling_diffusion.py:90-108) — it is not chunk-stateless; (c) SAC/SARM have no predict_action_chunk at all. Consequence: per-session processor instances (mandatory), a chunk-stateless allowlist, serving_mode: exclusive for diffusion-family, refusal at startup for SAC/SARM, and policy.reset() is never called in shared mode (§8.3).
P4 — v1 re-derives latency compensation that already exists, on top of broken clocks.
v1 §8 specifies an in-flight RTT dict and manual stale-step trimming. ActionQueue.merge(original, processed, real_delay, idx_before) already trims real_delay stale steps and cross-validates against actions consumed in flight (action_queue.py:219-246). Worse, the legacy code compares wall clocks across machines (robot_client.py:420 stamps time.time() "to compare timestamps across client and server"; policy_server.py:178 compares it) — NTP skew is the same order as the latencies being measured. Consequence: the monotonic iron rule (§11): instants never cross machines; client timestamps are opaque echoed tokens; servers report only durations. delay_steps = ceil((rtt + inference)/dt) is computed client-side from client-local perf_counter samples and shipped per request.
P5 — One-in-flight per client is a correctness requirement, not a tuning choice.
At send time the client snapshots idx_before = queue.get_action_index() and the leftover prefixes; merge validates against them. Two in-flight requests carry conflicting snapshots — the second merge corrupts both RTC replace mode and append mode. The local RTC thread is also strictly one-inference-at-a-time; one-in-flight preserves exact parity. Consequence: the worker publishes one observation, waits for its chunk (or timeout), then sends the next. v1 §8.1's out-of-order in-flight dict is dead weight; a late chunk is accepted only if it answers the latest outstanding seq_id, otherwise dropped.
P6 — v1's QoS table conflates transport reliability with congestion behavior.
"Reliable delivery for actions" sounds right but the dangerous knob is congestion control: a publisher configured BLOCK on the action topic can stall the server's publish path on one robot's dead uplink (Zenoh blocks up to wait_before_close, then may close the transport). A dropped action chunk is recoverable by design — the client's queue keeps the robot moving and the next chunk replaces it. Consequence (§6.3): actions = reliability=RELIABLE (hop-level) + congestion_control=DROP + express=True + priority=INTERACTIVE_HIGH; observations = DROP + DATA. If WAN loss proves material, upgrade the action topic to Zenoh Advanced Pub/Sub (cache + recovery, zenoh ≥ 1.5) rather than BLOCK.
P7 — Schema-less MessagePack invites silent version drift across a 300-robot fleet.
msgpack stays (zenoh's ext serializer cannot encode numpy/dataclasses, and the team's choice stands), but naked msgpack dicts across heterogeneous fleet versions fail at runtime, on the robot. Consequence (§10.4): a packed little-endian attachment header (schema_version, seq_id, episode_id, client_mono_ns — the rmw_zenoh pattern) so routing/correlation never deserializes the body; schema_version negotiated at the session handshake; additive-only evolution; golden codec tests. Protobuf-over-ZBytes is the documented fallback if drift bites in practice.
P8 — "Deterministic rollout reproducibility" is unattainable on real robots.
No seed controls hardware, sensor noise, or network jitter; RTC's latency-driven trimming is inherently timing-dependent. Consequence: the contract is fully logged + replayable (§12): recording strategies already persist observations and executed actions; the remote engine adds (session_id, seq_id, episode_id) provenance so client datasets join server audit logs mechanically.
P9 — v1 has no safety specification.
"Log a warning when the buffer empties" is not a fail-safe for a 300-robot fleet. Consequence (§9): a staleness bound (max_action_age_s — never execute an action older than X relative to its source observation), an explicit fallback ladder (hold / repeat_last / zero — zero-command required for future velocity-controlled robots), and a DEAD state that triggers the existing strategy shutdown path (return-to-initial-pose, disconnect) via the same shutdown_event mechanism RTC uses (rtc.py:359-360).
P10 — Capacity must be formula-driven, not "a user decision".
v1 §4 says clients-per-server "is a user decision". With t = server time per request, r = per-client request rate, H = RTC execution horizon, dt = control period:
N_max = min( 0.8 / (r·t), (H·dt/2 − RTT_net) / t )
→ ACT @ 20 ms, 1 Hz: ~40 clients/GPU. Pi0 @ 150 ms, 1 Hz: ~5 clients/GPU. 300 robots on Pi0 ≈ 60 GPU pods. Consequence: the manifest carries max_sessions; the server rejects session opens beyond it (with current load in the reply) so clients retry another replica. Micro-batching is deferred — blocked on a real API issue (predict_action_chunk takes a scalar inference_delay; batched clients have different delays) — behind a Scheduler seam so it can land later without redesign (§8.5).
P11 — Discovery ≠ multicast.
Zenoh's multicast scouting does not cross WAN, NAT, or most k8s CNIs. Consequence: multicast scouting disabled; clients use static connect.endpoints (DNS name of the router) + gossip; presence and liveness come from Zenoh liveliness tokens (§6.4), not discovery. "Discovery" for a robot fleet is configuration.
5. System Topology
(Diagram unchanged from v1 — the topology survives; transport/QoS/session details in it are superseded by §6.)
- Router tier: one or more
zenohdrouters (k8s Deployment + Service, TLS on 7447). Robots dial out to the router (NAT-friendly: labs only need outbound 7447/443). GPU servers join as peers via cluster DNS. - Server: one process = one
(model_repo, revision, dtype, device)on one GPU, pre-warmed from a YAML manifest (KEPT from v1, amended:pin_task: bool— VLA prompts may vary per session unless pinned). - Client: one robot running
lerobot-rollout --inference.type=remote. Weightless: config-only policy metadata. - Identity:
client_uuidper robot;session_idper connection epoch; both in every log line on both sides.
6. Zenoh Design
All Zenoh claims below were verified against zenoh / zenoh-python 1.x (eclipse-zenoh 1.9.0). Pin: eclipse-zenoh>=1.9,<2.0; keep zenohd on the same minor as the Python binding. Wheels cover manylinux x86_64/aarch64/armv7l/armv6l + macOS — Raspberry Pi edge clients are covered.
6.1 Key-expression schema
@lerobot/<model_id>/<revision>/<task_slug>/<client_uuid>/obs client → server
@lerobot/<model_id>/<revision>/<task_slug>/<client_uuid>/action server → client
@lerobot/<model_id>/<revision>/<task_slug>/status queryable (capabilities)
@lerobot/<model_id>/<revision>/<task_slug>/session queryable (open/validate)
@lerobot/<model_id>/<revision>/<task_slug>/<client_uuid>/reset queryable (episode boundary)
@lerobot/<model_id>/<revision>/<task_slug>/<client_uuid>/alive liveliness token (client)
@lerobot/<model_id>/<revision>/<task_slug>/server/alive liveliness token (server)
Rules (hard, enforced by a sanitize_keyexpr() helper):
- Root at the verbatim chunk
@lerobot— verbatim chunks are only matched by identical chunks, so third-party**subscribers on a shared router can never scrape the tree. - Sanitize every user-supplied segment (model ids, task strings, uuids): non-empty, no
* $ ? # /, no leading/trailing/double/. A task string containing/must be slugified before it becomes a key chunk. - Server subscribes with a single-depth wildcard (
.../*/obs) — never**(it would also matchstatus,alive, …). - v1's
cluster/experimentprefix segments are dropped from the key schema; they return as free-formtagsmetadata in the session handshake (telemetry/labeling, not routing). Routing topology belongs to deployment (which router you dial), not to key depth.
6.2 Data plane vs. control plane (the rmw_zenoh split)
- Data plane = pub/sub (KEPT from v1): observations up, action chunks down, correlated by
seq_idin attachments (§10.4). Pub/sub rather than query-per-inference because: a timed-out query's late reply is dropped by the transport (wasted inference), whereas a late pub/sub chunk is still mergeable if it answers the latest outstanding seq; and pub/sub leaves room for server-initiated messages (drain notices). The one-in-flight discipline (P5) is enforced in the client worker, not by the transport. - Control plane = queryables (request/reply with explicit timeouts; the pattern rmwzenoh uses for ROS 2 services):
status(pre-flight capability fetch, 2 s timeout),session(open/validate → ack with capabilities +session_id),reset(episode boundary — _acknowledged, so episodic strategies know the server-side episode state is clean). Always pass an explicittimeouttosession.get()— the config default is 10 s, far too long for our watchdogs. - Episode ordering: under one-in-flight there is no obs/reset race window in the data plane, but as belt-and-braces the first observation of each episode also carries
episode_start=True+ the newepisode_idin its header.
6.3 QoS (revised from v1 §6.2 — see P6)
| Topic | reliability | congestion_control | express | priority | Why |
|---|---|---|---|---|---|
obs |
default | DROP | false | DATA | Intentional drop already happened at the client's one-slot holder; if the uplink stalls, dropping a frame protects the control loop. |
action |
RELIABLE | DROP (never BLOCK) | true | INTERACTIVE_HIGH | Hop-level reliability over TCP; express skips batching for the small (4–50 KB) latency-critical payload; DROP so one dead robot uplink can never stall the server's publish path. Chunk loss is recoverable: the client buffer rides through it. |
| control queryables | RELIABLE | default | — | — | Correctness over latency; explicit timeouts bound them. |
Upgrade path if WAN chunk loss proves material: AdvancedPublisher/AdvancedSubscriber (zenoh ≥ 1.5) with a small cache + heartbeat-based recovery on the action topic only. Hop-by-hop RELIABLE is not end-to-end reliability — Zenoh has no broker persistence; a disconnected subscriber's data is gone. The design assumes this (client state machine, §9).
6.4 Liveliness (presence + watchdogs)
- Client declares a liveliness token on
.../<client_uuid>/alive. The server liveliness-subscribes withhistory=True: token appear → ensure session state; token drop → GC the session (mailbox, processor instances) after a grace period. - Server declares
.../server/alive. The client liveliness-subscribes: on drop → treat as RECONNECTING (§9), hold/fallback per config, re-run thestatus/sessionhandshake when the token reappears. - Tune the transport lease down from its default so ungraceful-death detection is seconds, not tens of seconds (verify the default in the pinned version; it is config
transport/link/tx/lease). - Liveliness cannot detect a hung-but-connected server. The client's per-request timeout (
request_timeout_s) is the authoritative watchdog — this is the structural fix for legacy BUG-3 (no deadlines onGetActions).
6.5 Threading constraints (zenoh-python facts that shape both processes)
- No asyncio API in zenoh-python — both client and server are thread-based. This matches the existing RTC engine pattern exactly.
- Each callback-based subscriber spawns a dedicated Python thread; blocking Zenoh calls inside callbacks are disallowed. Callbacks must be deposit-only (write a slot, set an event, return).
- Channel handlers (
FifoChannel,RingChannel) are Rust-side;try_recv()polls without spawning Python threads.RingChannel(1)is native latest-only semantics. - No zero-copy path for our payloads (SHM API is
@_unstableand same-host-only;ZBytescopy behavior undocumented). At ~200 KB × a few Hz per robot, one memcpy is irrelevant.
6.6 Router deployment
zenohdofficial image as a k8s Deployment (1–N replicas; routers mesh and reroute around failures) behind aLoadBalancer/NodePortService exposing TLS 7447. No official Helm chart exists — roll-your-own manifests.scouting.multicast.enabled: false;scouting.gossip.enabled: true; clients/servers use staticconnect.endpoints.- Auth: mTLS per robot (
transport.link.tlswithenable_mtls) + router ACL keyed oncert_common_names: a robot's cert may onlyputto@lerobot/**/<its-uuid>/obsand receive on.../<its-uuid>/action. Caveat (flagged): ACL config reloads require a router restart — plan cert/ACL changes as rolling router restarts. - Security review input: the third-party Zenoh protocol security analysis (Census Labs, 2025) should be read before exposing 7447 publicly.
7. The Statelessness Boundary (the load-bearing section)
Where the network cut goes. The local RTC pipeline is:
obs (robot-processed dict)
→ build_dataset_frame(hw_features, obs, "observation") CLIENT (cheap, hardware-coupled)
─────────────────────────── network ───────────────────────────
→ prepare_observation_for_inference(...) SERVER (policy-coupled, heavy)
→ per-session preprocessor(...) SERVER (stateful within the request)
→ policy.predict_action_chunk(obs, inference_delay, prefix) SERVER (pure for allowlisted policies)
→ per-session postprocessor(...) SERVER (reads state cached at preprocess)
─────────────────────────── network ───────────────────────────
→ ActionQueue.merge(original, processed, real_delay, idx_before) CLIENT
Three consequences:
- The server needs no cross-request state.
RelativeActionsProcessorStepwrites_last_stateat preprocess and the postprocessor reads it back within the same request. Per-session pipeline instances + one-request-at-a-time-per-session give correctness with zero persistent state. - RTC state stays client-side, exactly where
RTCInferenceEnginealready keeps it. Each request ships:inference_delay_steps = ceil(L_max/dt)(from the clientLatencyTracker, whose samples are full network-inclusive cycle times — RTT compensation falls out for free),prefix_model = queue.get_left_over()[:H], andprefix_robot = queue.get_processed_left_over()[:H](needed for server-side relative-prefix re-anchoring, mirroringrtc.py:287-305). The response returns both the model-space and robot-space chunks becausemergeneeds both. ≤execution_horizon × action_dimfloat32 each — a few hundred bytes. - G9 dies structurally. No bespoke client resize (
F.interpolatein legacyhelpers.py), no client-side normalization. Clients ship native camera resolution; the server's canonical processor path does everything — serve-time preprocessing is byte-identical to train-time.
What the server does hold (and what it means):
- Per-session processor instances (cheap; normalization stat tensors shared read-only).
- Per-session episode counter + stats. Episode reset = reset the session's pipelines, clear its mailbox.
policy.reset()is never called in shared mode — it is global to the shared policy instance and unnecessary for chunk-pure policies (ACT's ensembler and Pi0/SmolVLA's queues live inselect_action, notpredict_action_chunk— verified). - Policies that are not chunk-pure get
serving_mode: exclusive(§8.3).
8. The Inference Server: lerobot-policy-server
New package src/lerobot/policy_server/; console script lerobot-policy-server --manifest manifest.yaml.
8.1 Process model — KEPT from v1, amended
One process = one model+task on one GPU, loaded and warmed at startup (warmup_inferences dummy forwards; covers torch.compile). Multi-GPU nodes run N processes (CUDA_VISIBLE_DEVICES pinning). Dynamic model loading (SendPolicyInstructions) is rejected: pickle/RCE surface, arbitrary-download surface, and it destroys capacity planning. Amendment: pin_task: false (default) lets VLA clients set the task per session; pin_task: true rejects mismatched tasks at session open.
8.2 Concurrency (pure threads — no asyncio in zenoh-python)
zenoh subscriber (.../*/obs) inference worker (1 thread, owns GPU)
deposit-only callback: loop:
slots[client_uuid] = sample ──► pick next session with pending obs (RR ring)
(per-client latest-only) decode JPEG → per-session preprocess
predict_action_chunk(delay, prefix)
control queryables (status/session/ per-session postprocess → encode
reset): validate, mutate session publisher.put(.../<uuid>/action)
registry, reply (publishing from the worker thread is fine)
- Per-client latest-only mailbox: a wildcard subscriber with a deposit-only callback writing per-client slots (scales to dynamic fleets), or — when the manifest enumerates clients — one
RingChannel(1)subscriber per client polled viatry_recv(). Either way: newest observation wins; a superseded request is counted (superseded_seqsin the next response) so drops are visible. This deletes legacy BUG-4 (observations_similar+must_go) by construction — the client decides when to request; the server never second-guesses observation content. - Single inference worker: torch releases the GIL inside
forward, callbacks stay responsive. Strict round-robin over sessions with pending observations: each gets exactly one inference per cycle; starvation is structurally impossible. Overload degrades into longer cycle times → larger (but correct) clientdelay_steps→ eventually the client staleness bound trips and the robot holds — safe by construction.
8.3 Chunk-stateless allowlist and serving modes
At startup the server classifies the loaded policy:
| Class | Policies (verified) | Mode |
|---|---|---|
| chunk-stateless | ACT, Pi0, Pi0.5, SmolVLA (and any policy whose predict_action_chunk touches no instance state) |
shared: N sessions, per-session pipelines, policy.reset() never called |
| chunk-stateful | Diffusion family (predict_action_chunk reads select_action-fed self._queues) |
exclusive: max_sessions=1 enforced; episode reset additionally calls policy.reset(); second session open → rejected with a self-explanatory error |
| no chunk API | SAC, SARM | refused at startup |
Implemented as a registry in policy_server/validation.py; the cleaner follow-up is a supports_stateless_chunking class attribute on PreTrainedPolicy (needs a pass over policy families — roadmap §14).
8.4 Session open & capability validation (fail fast, fail loud)
session queryable payload: client_uuid, policy_type, fps, feature summary (post-rename observation feature names + shapes, ordered action keys), schema_version, RTC intent, tags. Checks:
| Check | Rule | On mismatch |
|---|---|---|
| Action names and order | must equal server's action_feature_names exactly |
hard reject — this is the sync-safety contract mapping chunk columns to motors |
| Camera names | client set must cover policy.config.input_features image keys |
hard reject |
| Resolution | any H×W accepted (server resizes canonically) | warn if aspect ratio differs from training |
| State dim | flattened dim must match | hard reject |
schema_version |
client within server's supported range | hard reject |
| fps | vs. manifest trained_fps |
warn (reject only when strict_fps: true) |
| Task | when pin_task: true, must equal default_task |
reject |
| RTC | client RTC requires policy RTC kwargs support | downgrade to append mode + warning |
| Capacity | active_sessions < max_sessions |
reject with current load → client retries another replica |
Reply: session_id, model info (repo, revision — consider a checkpoint hash, §15), action_feature_names, chunk_size, trained_fps, supports_rtc, serving_mode, warmed_up, schema_version, warnings. rename_map is applied client-side so the wire format is canonical policy-feature keys across heterogeneous robots (also a prerequisite for future batching).
8.5 Scheduler seam (micro-batching later, not in v1)
The worker calls a Scheduler.select(ready: list[Session]) -> list[Session]; v1 ships RoundRobin (return ready[:1]). Cross-session batching is blocked on the policy API (inference_delay is scalar; batched clients have different delays/prefixes) — when that lands, a MicroBatch scheduler groups same-shape sessions. The seam costs nothing now and prevents a redesign later.
8.6 Manifest
model:
{
repo_or_path: lerobot/pi0_towels,
revision: main,
dtype: bfloat16,
device: cuda,
}
default_task: "fold the towel"
pin_task: false
serving_mode: shared # forced to exclusive for chunk-stateful policies
max_sessions: 5 # from the §P10 formula: Pi0 @150ms, 1 Hz refresh
warmup_inferences: 2
strict_fps: false
zenoh:
connect_endpoints: ["tls/router.gpu-cluster.internal:7447"]
tls:
{
connect_certificate: ...,
connect_private_key: ...,
root_ca_certificate: ...,
}
health_port: 9100 # HTTP health + Prometheus metrics
debug: { capture_dir: null, capture_max: 256 }
Draccus dataclass in policy_server/manifest.py; YAML via --manifest, individual overrides via CLI.
9. The Edge Client: RemoteInferenceEngine
New file src/lerobot/rollout/inference/remote.py, registered @InferenceEngineConfig.register_subclass("remote").
9.1 Threading model
| Thread | Role |
|---|---|
| Main (strategy loop) | notify_observation(obs) → lock-protected latest-only slot (identical to rtc.py _obs_holder). get_action() → ActionQueue.get() + staleness check. Never any I/O. Structurally fixes legacy BUG-1 (blocking send inside the 33 ms loop). |
| Network worker (1 daemon thread) | Cycle: wait until queue_remaining·dt ≤ buffer_time_s and active → snapshot idx_before, prefixes, delay_steps = ceil(L_max/dt) → encode (JPEG q=jpeg_quality) → publisher.put(obs, attachment=header) → await chunk on the action subscriber channel (timeout request_timeout_s) → merge(original, processed, ceil(L/dt), idx_before) → latency_tracker.add(L). Owns the state machine, reconnects, and control queries. One-in-flight (P5). |
| Zenoh action subscriber | FifoChannel(2) handler drained by the worker (no Python callback thread on the hot path); liveliness subscriber callback is deposit-only (sets an event). |
Reused unchanged: ActionQueue (policies/rtc/action_queue.py), LatencyTracker, ActionInterpolator (lives in strategies — interpolation_multiplier works with remote for free). Deleted concepts: aggregation zoo, observations_similar, must_go, TimedObservation/TimedAction pickles.
9.2 Fail-safe state machine
ok no chunk for degraded_after_s
CONNECTING ─────► STREAMING ───────────────────────────────► DEGRADED
│ ▲ ▲ │ queue empty OR max_action_age_s hit │
│ │ backoff, │ └───────────────────────────────────► STALLED ◄──┘
│ │ re-handshake │ first successful merge │
│ └─ RECONNECTING ◄── timeout streak / server liveliness drop ◄─┘
│ │ offline > max_offline_s, capability/schema mismatch, auth failure
└──────► DEAD (failed=True → shutdown_event → strategy teardown: return-to-initial-pose)
- DEGRADED: requests failing but the queue still holds actions — the robot keeps executing; chunks are the fault-tolerance buffer (1–3 s of coverage makes blips and clean server drains invisible).
- STALLED: queue empty or staleness bound hit → apply
fallback:hold(get_action→None;send_next_actionalready tolerates it),repeat_last, orzero(required for velocity-controlled robots, where "send nothing" means "keep last velocity"). - Staleness bound (sync safety): every merge records
(chunk_start_index, t_send);get_actionrefuses any action whose source observation is older thanmax_action_age_s(default 3.0 s ≈ 90 steps @ 30 fps). Bounds open-loop execution after a network stall. - DEAD: only after
max_offline_s(default 60 s) or a hard contract violation (capability/schema mismatch on reconnect — e.g. the server restarted with a different model; never execute wrong-model chunks). Uses the exact mechanism RTC uses (failed=True+ globalshutdown_event) so existing teardown runs unchanged. - Watchdog layering: per-request timeout (hung server — the BUG-3 fix) → server liveliness token (dead server/router) → staleness bound (the robot-side invariant that holds regardless of why data stopped).
- Pause/resume (DAgger):
pause()stops the worker publishing (slot keeps refreshing, ignored); queue intact — parity withRTCInferenceEngine.pause. DAgger's existinginterpolator.reset(); engine.reset(); engine.resume()sequence works unchanged. reset()(episode boundary): clearActionQueue+ staleness bookkeeping, bumpepisode_id, fire the ackedresetquery (1 s timeout, failure logged — the server has nothing it must do thanks to per-request statelessness), flagepisode_starton the next observation.LatencyTrackerintentionally survives reset (latency is episode-invariant; parity with local RTC).ready= session opened ∧ capabilities validated ∧ serverwarmed_up. First-chunk gating is implicit (get_action→Noneuntil the first merge).
9.3 Weightless client — exact integration changes
rollout/context.py:PolicyContext.{policy, preprocessor, postprocessor}become| None. For remote configs, skip step 1 (weight load / PEFT /.to(device)/ torch.compile /init_rtc_processor) and step 6 (make_pre_post_processors). Verified safe: strategies only consumectx.policy.inference. Keep steps 2–5 (robot processors, hardware, features, dataset) — they are robot-derived. Keep the visual pre-flight check (context.py:309-324):--policy.pathalready loads config-only (rollout/configs.py:324-328, no weight download) and failing before dialing the server is free.use_torch_compile/ explicit--device→ warn-and-ignore for remote.rollout/inference/factory.py: signature loosens topolicy: PreTrainedPolicy | None(+policy_config: PreTrainedConfig);sync/rtcbranches guardpolicy is None; theremotebranch lazy-imports (eclipse-zenohstays an optional extra).- The authoritative validation moves to session open (§8.4); the local check becomes a fast-fail convenience.
9.4 Config
@InferenceEngineConfig.register_subclass("remote")
@dataclass
class RemoteInferenceConfig(InferenceEngineConfig):
connect_endpoint: str = "tls/localhost:7447" # zenoh router endpoint
tls_cert: str | None = None; tls_key: str | None = None; tls_ca: str | None = None
client_uuid: str = "" # "" → uuid4 at start()
jpeg_quality: int = 90 # 0 = raw (LAN/debug)
buffer_time_s: float = 0.5 # send next obs when queue playback ≤ this (v1 G14) — KEPT
max_action_age_s: float = 3.0 # staleness bound (safety)
degraded_after_s: float = 1.0
request_timeout_s: float = 5.0
reconnect_initial_backoff_s: float = 0.5
reconnect_max_backoff_s: float = 10.0
max_offline_s: float = 60.0
fallback: FallbackBehavior = FallbackBehavior.HOLD # hold | repeat_last | zero
rtc: RTCConfig = field(default_factory=RTCConfig) # enabled → replace mode; horizon caps prefix
tags: dict[str, str] = field(default_factory=dict) # ex-cluster/experiment labels
# Remote RTC + sentry recording (the reproducibility path)
lerobot-rollout \
--strategy.type=sentry \
--policy.path=lerobot/pi0_towels \ # config-only: no weights downloaded
--inference.type=remote \
--inference.connect_endpoint=tls/router.gpu-cluster.internal:7447 \
--inference.rtc.execution_horizon=10 \
--robot.type=so100_follower --robot.port=/dev/ttyACM0 \
--robot.cameras="{front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
--dataset.repo_id=user/rollout_fleet_a --dataset.single_task="fold the towel"
10. Wire Schema
10.1 Payload anatomy & rates — KEPT (JPEG) with numbers
Upstream per request: joints (24–128 B) + JPEG frames (480p q90 ≈ 40–90 KB each; 720p ≈ 110–230 KB) + RTC prefixes (≤ a few KB) → 60–450 KB depending on cameras. Downstream: 2 × chunk_size × action_dim × 4 B + metadata → 3–50 KB. Effective request rate is self-clocked by buffer_time_s to ~1–4 Hz per robot (not the 30 Hz control rate). 300 robots ≈ 0.3–10 Mbps each — the wire is never the bottleneck; bandwidth budgeting is about camera count/resolution, and each GPU pod only ever sees its own ≤ max_sessions clients. Zenoh fragments >64 KiB payloads transparently; multi-MB messages are fine.
10.2 Attachment header (fixed-layout, packed little-endian — parsed without touching the body)
| Field | Type | Notes |
|---|---|---|
schema_version |
u16 | negotiated at session open |
msg_type |
u8 | OBS / CHUNK / EVENT |
seq_id |
u64 | per-session monotonic; echoed in the chunk |
episode_id |
u32 | bumped by reset() |
client_mono_ns |
i64 | client monotonic_ns(); opaque to the server, echoed back |
session_epoch |
u32 | bumped per (re)connect; stale-epoch chunks dropped |
10.3 msgpack bodies
ObservationMsg (client → server): state: {names_ref, data: f32 LE bytes}, images: {name: {codec: jpeg|raw, bytes, (h,w,c) if raw}}, task: str, inference_delay_steps: int, prefix_model: tensor?, prefix_robot: tensor? (tensors = raw LE bytes + dtype + shape), episode_start: bool.
ActionChunkMsg (server → client): seq_id_echo, client_mono_ns_echo, chunk_model: tensor, chunk_robot: tensor, queue_wait_ms: f32, inference_ms: f32, superseded_seqs: u32, server_load: f32.
Status / SessionOpen / SessionAck / ResetMsg: as specified in §8.4.
10.4 Schema discipline (P7)
schema_version gates at handshake; evolution is additive-only (new optional msgpack keys; unknown keys ignored); attachment layout changes require a version bump; golden codec round-trip tests (tensor exactness, JPEG RGB-channel-order regression — a silent BGR swap poisons every VLA in the fleet) are part of the test suite. No pickle anywhere — KEPT from v1 and now structural: nothing in the schema can carry code.
11. Latency Budget & the Clock Iron Rule
| Stage | LAN | WAN (50 ms RTT) |
|---|---|---|
| JPEG encode ×3 (edge CPU) | 2–9 ms | 2–9 ms |
| Serialize | <1 ms | <1 ms |
| Uplink (tx + ½RTT) | ~2 ms | ~54 ms |
| Server queue wait | 0 → 1×inference | 0 → 1×inference |
| Decode + canonical preprocess | 4–10 ms | 4–10 ms |
| Inference | 15–150 ms | 15–150 ms |
| Postprocess + downlink + merge | ~2 ms | ~27 ms |
| Total (Pi0-class) | ~110–175 ms | ~190–250 ms |
Inference is 60–85 % of end-to-end on LAN; the entire transport+serialization stack is <10 ms. WAN adds propagation + uplink bandwidth — identical under any transport. At 30 fps this lands delay_steps ≈ 4–8, comfortably inside RTC execution horizons: WAN degrades smoothness parameters, never correctness. This table is the standing answer to transport-performance bikeshedding.
Clock iron rule (P4): wall-clock instants never cross machines. Client stamps monotonic_ns, the server echoes it opaquely; RTT = now − echo. The server reports only durations (queue_wait_ms, inference_ms) measured on its own monotonic clock; network_time = RTT − queue_wait − inference for diagnostics. The schema has no field in which a foreign wall-clock instant can be compared — the legacy time.time() bug is unrepresentable.
12. Reproducibility & Audit (P8)
The contract is fully logged + replayable, not "deterministic":
- Client = source of truth. Recording strategies already persist observations + executed actions to
LeRobotDataset. The remote engine logs, per executed action, the(session_id, seq_id, episode_id)of its source chunk plus the echoedqueue_wait_ms/inference_ms(dataset-extras columns are a follow-up; client logs in v1). - Server audit line per request (structured JSON):
{ts, session_id, client_uuid, seq_id, episode_id, queue_wait_ms, inference_ms, chunk_range, superseded_seqs, outcome}. - Optional bounded capture:
debug.capture_dirwrites a ring of request/response pairs (safetensors) for byte-exact offline replay through the same server pipeline. - Runbook — "robot #217 stuttered at 14:03": (1) Grafana
session_staleness{client="217"}— spike ⇒ server side, flat ⇒ client/network. (2) Server side: audit lines —queue_wait_msrising across all sessions ⇒ overloaded replica (checkactive_sessionsvsmax_sessions);superseded_seqsstreak on 217 only ⇒ that client over-requesting;outcome=error⇒ adjacent stack trace. (3) Client side: state-machine transitions + reconnects in the client log; dataset rows show which seq's chunk was executing and whereNoneticks occurred. Every hop shares(session_id, seq_id)— the join is mechanical.
13. Integration & Migration Plan
13.1 New
| Path | Content |
|---|---|
src/lerobot/policy_server/{__init__,schema,codec,manifest,session,scheduler,validation,server}.py |
wire schema constants, msgpack/attachment codecs, manifest dataclasses, Session + mailbox, Scheduler seam, capability rules + chunk-stateless registry, zenoh servicer + inference worker + drain + HTTP health/metrics |
src/lerobot/rollout/inference/remote.py |
RemoteInferenceEngine (~600 lines; mirrors rtc.py structure) |
src/lerobot/scripts/lerobot_policy_server.py + [project.scripts] entry |
thin main() |
docker/Dockerfile.policy-server |
CUDA runtime base + uv; manifest via ConfigMap |
docs/source/remote_inference.mdx (+ _toctree.yml) |
replaces async.mdx |
13.2 Modified
rollout/inference/factory.py (config + Optional-typed signature + lazy import) · rollout/context.py (weightless branch) · rollout/inference/__init__.py · scripts/lerobot_rollout.py docstring · pyproject.toml: [async] extra becomes eclipse-zenoh>=1.9,<2.0 + msgpack (grpcio/matplotlib leave it; grpcio remains under [hilserl]/dev for the RL stack).
13.3 Removed — same landing PR
src/lerobot/async_inference/ · tests/async_inference/ · docs/source/async.mdx + its _toctree.yml entry · the AsyncInference service + Observation/Actions/PolicySetup messages from src/lerobot/transport/services.proto (regenerate pb2; LearnerService untouched — transport/ is shared with HIL-SERL (src/lerobot/rl/); the RL test suite gates this change).
13.4 Legacy config → successor mapping
Legacy (RobotClientConfig/PolicyServerConfig) |
Successor |
|---|---|
server_address |
--inference.connect_endpoint (zenoh router) |
policy_type, pretrained_name_or_path |
--policy.path (config-only) + server manifest |
chunk_size_threshold (0–1 ratio) |
--inference.buffer_time_s (seconds) |
actions_per_chunk |
server manifest (validated at session open) |
aggregate_fn_name + AGGREGATE_FUNCTIONS |
dropped — ActionQueue replace/append |
policy_device, client_device |
dropped — server concern / chunks arrive CPU f32 |
debug_visualize_queue_size |
dropped — Rerun (--display_data) + engine stats |
PolicyServerConfig.{host,port} |
manifest zenoh.connect_endpoints |
inference_latency, obs_queue_timeout |
dropped — latency client-measured; no server obs queue |
SendPolicyInstructions |
dropped — MaaS manifest + session validation |
observations_similar / must_go |
dropped — latest-only slots + client send gate |
| pickle envelopes | dropped — msgpack + attachment headers |
13.5 Legacy bugs/gaps → structural resolution
BUG-1 → worker thread owns all I/O. BUG-2 → aggregation deleted; ActionQueue is internally locked. BUG-3 → per-request timeout + liveliness. BUG-4 → client-side send gating; server newest-wins. G1 → per-session registry. G2 → manifest. G4 → msgpack+attachments. G5 → monotonic echo + delay_steps. G7 → recording strategies. G8 → mTLS + ACL. G9 → server-side canonical processors. G11 → status queryable. G12 → Prometheus + audit logs. G13 → lerobot-policy-server console script. G14 → buffer_time_s.
13.6 Tests
- Unit: codec round-trips (tensor exact; JPEG RGB-order regression), capability-validation matrix (§8.4 as parametrized cases), scheduler fairness + newest-wins supersession (mock policy with configurable sleep), manifest parsing, key-expr sanitization.
- Loopback integration (CPU, fast CI): client+server in one process over zenoh peer-to-peer (or a localhost
zenohdstarted by the fixture), tiny-ACT, fake 2-camera robot, N=8 concurrent sessions. The headline regression: two sessions with different joint states must not cross-contaminateRelativeActionsProcessorSteppostprocessing — the test that proves the multi-tenancy claim. - Chaos: kill the server mid-episode → client returns
None, never raises into the control loop,failedstays False withinmax_offline_s, resumes on restart;docker kill zenohd→ liveliness flap → safe state → re-handshake (explicitly tests re-declaration behavior, flagged unverified upstream); SIGTERM drain → in-flight chunk completes, clients reconnect invisibly. - Golden parity: remote RTC vs local
RTCInferenceEngineon identical observation sequences → byte-identical merged queues (the re-anchoring contract test). Gate for any real-robot remote-RTC use.
14. Roadmap
- PR1 — schema & codecs (no torch deps):
policy_server/{schema,codec,manifest}.py, key-expr sanitizer, golden codec tests. - PR2 — server core: session registry, scheduler, validation/allowlist, inference worker with mock policy, loopback harness.
- PR3 — client engine:
RemoteInferenceEngine, factory/context weightless integration, loopback integration + chaos + golden-parity tests. - PR4 — ops & docs: Dockerfile, health/metrics, drain, ACL examples,
remote_inference.mdx, rollout docstring. - Landing PR — legacy deletion: remove
async_inference/+ tests + docs + proto service (RL suite gates),[async]extra swap. - Pre-release field validation: one real robot on a lossy network (watchdog default tuning); JPEG q90 vs raw A/B on one policy (train/serve shift).
- Future: micro-batching (needs per-sample
inference_delayacross policy families), client-side downscale-to-policy-resolution (config-only shapes make it possible), Advanced Pub/Sub on the action topic, per-robot quotas, dataset provenance columns,supports_stateless_chunkingattribute upstreamed to policy classes.
15. Open Risks
| Risk | Mitigation / decision needed |
|---|---|
Re-anchoring parity (server-side relative-prefix re-anchor vs rtc.py) |
Golden parity test (§13.6) is a hard gate before robot use; likely failure mode is normalizer dtype/device drift |
First-chunk over-trim when idle: merge trims ceil(L/dt) even when nothing was consumed (queue empty at episode start) — wasteful at network latencies (600 ms ⇒ 18 steps) |
Proposed clamp real_delay = min(real_delay, last_index - idx_before) touches the shared ActionQueue used by local RTC — needs sign-off + regression tests |
| JPEG train/serve distribution shift | Unmeasured; A/B before locking q90 default (roadmap §14.6) |
Watchdog defaults untuned (request_timeout_s=5, degraded_after_s=1, max_action_age_s=3) |
Field validation on wired and Wi-Fi; consider named profiles |
| Capability check can pass while semantics differ (different finetune, different normalization stats, identical feature names) | Add checkpoint hash/revision pinning to SessionAck — decide in PR2 |
| zenoh-python long-session maturity: re-declaration after router restart partially verified; SHM unstable; no asyncio | Chaos tests own this; thread-based design avoids the asyncio gap entirely |
| Router ACL reload requires restart | Operational runbook: cert/ACL changes = rolling router restart |
fallback=zero has no consumer until velocity actions land in rollout (only .pos features routed today) |
Validate the enum against robot capabilities when velocity support lands |
| Per-client mailbox memory under fleet-scale wildcard subscription | One decoded-obs slot per client is small; add an LRU GC tied to liveliness drops |