mirror of
https://github.com/huggingface/lerobot.git
synced 2026-06-30 06:37:15 +00:00
docs(jobs): document the model-pushed marker contract and orphaned repos
Follow-up to the claude[bot] review on #3856 (non-blocking observations): - Cross-reference the "Model pushed to <url>" log line between its producer (PreTrainedPolicy.push_model_to_hub) and the remote-run consumer in submit_to_hf, noting the contract is an early-finish optimization that falls back to status polling if it drifts. - Note in the HF Jobs guide that a failed remote run leaves its model repo on the Hub (it is not auto-deleted) and how to remove it.
This commit is contained in:
@@ -626,6 +626,8 @@ Every job (and any dataset pushed by the run) is tagged `lerobot` so it's easy t
|
||||
|
||||
By default the job runs until training finishes, with no time limit. Cap it with an HF Jobs duration string if you want a hard ceiling, e.g. `--job.timeout=4h`.
|
||||
|
||||
> **Note:** the model repo is created up front (it holds the staged training config the job runs from). If a run fails before the model is pushed, that repo is left on the Hub so you can inspect it — it is not deleted automatically, so repeated failures can leave empty repos behind. Remove one with `hf repo delete <repo-id>`.
|
||||
|
||||
**Prerequisites:** run `hf auth login` before submitting. For Weights & Biases integration, run `wandb login` or set `WANDB_API_KEY` on your machine — the key is forwarded to the job automatically.
|
||||
|
||||
#### Upload policy checkpoints
|
||||
|
||||
@@ -300,7 +300,10 @@ def submit_to_hf(cfg: TrainPipelineConfig) -> None:
|
||||
poll_thread = threading.Thread(target=_poll, daemon=True)
|
||||
poll_thread.start()
|
||||
# Finish as soon as the model is pushed, rather than waiting out the platform's
|
||||
# post-run finalization before the job stage flips to COMPLETED.
|
||||
# post-run finalization before the job stage flips to COMPLETED. This matches the
|
||||
# exact log line emitted by PreTrainedPolicy.push_model_to_hub — the two must stay
|
||||
# in sync. If it ever stops matching we just fall back to stage-based completion
|
||||
# (~30s slower), so the contract is an optimization, not a correctness requirement.
|
||||
success_marker = f"Model pushed to https://huggingface.co/{repo_id}"
|
||||
log_thread = threading.Thread(
|
||||
target=_tail_logs, args=(job_id, done, success_marker, pushed_ok), daemon=True
|
||||
|
||||
@@ -340,6 +340,9 @@ class PreTrainedPolicy(nn.Module, HubMixin, abc.ABC):
|
||||
ignore_patterns=["*.tmp", "*.log"],
|
||||
)
|
||||
|
||||
# Contract: lerobot.jobs.hf.submit_to_hf watches for this exact
|
||||
# "Model pushed to <url>" line to end a remote run early. Keep the wording
|
||||
# and URL format in sync (it falls back to status polling if they drift).
|
||||
logging.info(f"Model pushed to {commit_info.repo_url.url}")
|
||||
|
||||
def generate_model_card(
|
||||
|
||||
Reference in New Issue
Block a user