* refactor(training): rename eval_freq to env_eval_freq
- Rename eval_freq to env_eval_freq to distinguish sim environment evaluation from offline loss evaluation.
* feat(training): add inline offline validation with train/eval split
- Add eval_split config for balanced per-task holdout
- Add eval_steps for periodic inline eval loss computation
- Add max_eval_samples to cap eval cost
* fix(datasets): remap absolute indices in __getitem__ for filtered datasets
* fix(train): vectorize eval subset selection for max_eval_samples
* fix(datasets): Move the remapping into EpisodeAwareSampler via absolute_to_relative_idx
* fix(validation): add eval_split range check and eval_steps warning
Validate eval_split is in [0.0, 1.0) to prevent garbage splits from
out-of-range values. Raise when eval_steps > 0 but eval_split is 0.0
since no offline eval will run.
* fix(train): prepare eval dataloader with accelerator for multi-GPU
Prepare eval_dataloader through accelerator.prepare() so eval data is
sharded across ranks instead of duplicated. Reduce eval_loss across
ranks with mean reduction for consistent logging.
* fix(test): rename eval_freq to env_eval_freq for multi-GPU training
Steerable annotation pipeline (lerobot-annotate) that populates the language_persistent and language_events columns introduced in PR 1 (#3467) directly into data/chunk-*/file-*.parquet.
This is PR 2 of the three-PR plan:
PR 1 (Add extensive language support #3467): schema + DSL + rendering, base of this PR
PR 2 (this PR): annotation pipeline writing into PR 1's columns
PR 3: model with language prediction and runtime
A VLM (Qwen-VL family, served on vLLM) watches each episode's video and emits grounded language annotations: subtasks, plans, memory, task rephrasings, interjections + speech, and per-camera VQA. The pipeline is built for production annotation at scale — single-camera grounding, embedded-frame inputs, a describe-then-segment grounding flow, and a deterministic full-episode coverage guarantee — informed by Scale's dense-captioning findings (representation > sampling, rules > reasoning, model capacity is the biggest lever, two-pass systems compound errors)
* feat(policies): Initial setup to push policies to hub with tags and model card
* feat: add dataset that is used to train
* Add model template summary
* fix: Update link model_card template
* fix: remove print
* fix: change import name
* fix: add model summary in template
* fix: minor text
* fix: comments Lucain
* fix: feedback steven
* fix: restructure push to hub
* fix: remove unneeded changes
* fix: import
* fix: import 2
* Add MANIFEST.in
* fix: feedback pr
* Fix tests
* tests: Add smolvla end-to-end test
* Fix: smolvla test
* fix test name
* fix policy tests
* Add push to hub false policy tests
* Do push to hub cleaner
* fix(ci): add push_to_hub false in tests
---------
Co-authored-by: Steven Palma <steven.palma@huggingface.co>
- Changes on the `test.yml` workflow:
- Using poetry instead of pip. Contrary to what I wrote in #75, it is possible to use poetry (and have the benefits of shorter install times) without the need for having two separate versions of `pyproject.toml` and `poetry.lock`.
- Reduce the trigger scope to only run when files in these directories are modified:
- `lerobot/`
- `tests/`
- `examples/`
- `.github/`
- Add `style.yml` workflow for doing a `ruff check` pass on the code
- More cleanup (removed deprecated workflow)