fix(jobs): default remote job timeout to 2d instead of the platform default

HF Jobs applies its own short 30-minute timeout when none is sent, which silently kills long training runs. Pass an explicit, generous 2d cap by default; users can still override --job.timeout to fail fast or extend it.
2026-06-30 14:47:10 +00:00 · 2026-06-25 21:23:01 +02:00
parent b686c35bc3
commit 77e4ea2afa
3 changed files with 5 additions and 4 deletions
@@ -630,7 +630,7 @@ If your dataset exists only locally (not yet on the Hub), it is automatically pu

 Every job (and any dataset pushed by the run) is tagged `lerobot` so it's easy to find on the Hub. Add your own with `--job.tags '["my-tag"]'`.

-By default the job runs until training finishes, with no time limit. Cap it with an HF Jobs duration string if you want a hard ceiling, e.g. `--job.timeout=4h`.
+By default the job is capped at `2d` (48h) of wall-clock. Override it with an HF Jobs duration string, e.g. `--job.timeout=4h` to fail faster or `--job.timeout=7d` for a longer run.

 > **Note:** the model repo is created up front (it holds the staged training config the job runs from). If a run fails before the model is pushed, that repo is left on the Hub so you can inspect it — it is not deleted automatically, so repeated failures can leave empty repos behind. Remove one with `hf repo delete <repo-id>`.