EMA was on by default, so every training run on the branch (incl. VLA-JEPA
and other non-flow-matching policies) created a full fp32 shadow copy. EMA
only benefits flow-matching/diffusion policies (pi0/pi05/pi052). Make it
opt-in via --ema.enable=true; the pi05/pi052 recipes already pass that flag.
Co-authored-by: Cursor <cursoragent@cursor.com>
LeRobot's RoboCasaEnv used a divergent flat state/action layout vs the
robocasa package (robocasa.utils.env_utils.convert_action) and the openpi
robocasa pipeline. This scrambles I/O when using openpi-convention checkpoints
(e.g. the JAX->PyTorch->LeRobot converted pi05 robocasa model: CloseFridge
20% -> 60% once both orders match openpi).
- convert_action: ee_pos(3)+ee_rot(3)+gripper(1)+base_motion(4)+control_mode(1)
- observation.state: ee_pos_rel(3)+ee_rot_rel(4)+base_pos(3)+base_rot(4)+gripper(2)
Matches openpi examples/robocasa/main.py + RobocasaInputs ordering.
Co-authored-by: Cursor <cursoragent@cursor.com>
Eliminate the standalone pi052/pi05_backbone.py by distributing its contents:
- Generic dual-expert transformer machinery -> lerobot/policies/pi_gemma.py
(sdpa_attention_forward, compute_layer_complete, PaliGemmaWithExpertModel,
get_gemma_config; the openpi width/depth config is renamed GemmaConfig ->
GemmaVariantConfig to avoid clashing with transformers' GemmaConfig). These
sit next to the existing PiGemma layer code they already depend on.
- pi052-specific model + helpers -> pi052/modeling_pi052.py (PI05Pytorch,
ActionSelectKwargs, make_att_2d_masks, pad_vector, resize_with_pad_torch,
create_sinusoidal_pos_embedding, sample_beta, get_safe_dtype).
DEFAULT_IMAGE_SIZE is duplicated as a plain constant in pi_gemma to avoid a
pi_gemma -> pi05 import cycle. Additive to pi_gemma; pi0/pi05 unaffected.
Verified bit-exact on pepijn223/pi052_robocasa_full (embed/predict/forward
identical) and all 34 pi052 tests pass.
Co-authored-by: Cursor <cursoragent@cursor.com>
_fast_ce/_shifted_ce were renamed to _fast_lin_ce/_shifted_lin_ce and changed
from logits-based to Liger fused-linear-CE (hidden @ lm_head_weightᵀ). Update
the tests via thin adapters that pass an identity lm_head_weight (so the
computed logits equal the provided ones), run on CUDA (Liger is GPU-only) and
skip otherwise, and loosen the allclose tolerance to absorb GPU-vs-CPU float
noise on the tiny losses.
Co-authored-by: Cursor <cursoragent@cursor.com>
The smolvla branch had modified the shared pi0/pi05 modeling + pi05 config to
support pi052 (SDPA attention, layernorm/lm_head handling, optimizer
foreach/fused/lm_head_lr_scale, embedding scaling). Decouple pi052 instead:
- Vendor the PI0.5 backbone (PaliGemmaWithExpertModel, PI05Pytorch, helpers)
into pi052/pi05_backbone.py (verbatim copy, no PI05Policy).
- Flatten PI052Policy to subclass PreTrainedPolicy directly (no longer
PI05Policy); inline the needed PI05Policy methods.
- Restore optimizer_foreach/fused + get_optimizer_preset on PI052Config.
- Revert pi0, pi0_fast, pi05 modeling and configuration_pi05 to origin/main
(byte-identical), so the shared policies carry no smolvla modifications.
Behavior verified bit-exact on pepijn223/pi052_robocasa_full: embed_language_
tokens, predict_action_chunk, and the fused flow+text+FAST training loss are
identical before/after (max_abs_diff=0). pi052 tests pass (pre-existing
stale-name collection errors unchanged).
Co-authored-by: Cursor <cursoragent@cursor.com>
* first commit
* feat(policies): add VLA-JEPA
* feat(policies): add VLA-JEPA
* support vla_jepa
* (feat)policies: add VLA-JEPA
* linting
* adding deps to pyproject.toml
* updating uv lock
* adding guards to avoid needing transformers and diffusers for type checking and basic tests
* fixing action and state dim
* fix warnings with qwen processor kwargs
* fixing wm_loss not propagating
* adjusting obs steps, tublets size to match original implementation
* some more fixes to be closer to the original implem
* adding more tests to ensure good coverage
* align VLA-JEPA architecture with original checkpoint
- Remove stale `action_num_heads` / `action_attention_head_dim` config fields;
DiT head dimensions are now always derived from the preset (DiT-B/L/test).
- Add `num_target_vision_tokens` and `action_max_seq_len` config fields required
by the action head's future-token embedding and positional embedding tables.
- Fix default `qwen_model_name` to 2B (matches all released checkpoints).
- Rename `ActionEncoder` attrs w1/w2/w3 → layer1/layer2/layer3 to match
checkpoint key names; replace `nn.Sequential` decoder/state-encoder with
`_MLP2` (layer1/layer2 naming).
- Fix `VLAJEPAActionHead` to size ActionEncoder and StateEncoder at `inner_dim`
(DiT input width) rather than `action_hidden_size` (DiT output width).
- Rename `DiT.blocks` → `transformer_blocks` and `attn` → `attn1` to match
checkpoint; add alternating cross/self attention (even blocks cross-attend to
Qwen context, odd blocks self-attend).
- Add `DiT-test` preset for unit tests.
- Rewrite `ActionConditionedVideoPredictor` with explicit ViT-style blocks
(`_PredictorBlock` with fused qkv) to match checkpoint structure; rename
`encoder`/`norm`/`proj` → `predictor_blocks`/`predictor_norm`/`predictor_proj`.
* propagate action_is_pad masking through VLA-JEPA policy pipeline
Pass the `action_is_pad` tensor from the batch through to the action head
so padded timesteps are excluded from the flow-matching loss.
* update VLA-JEPA tests for arch changes and action_is_pad
- Switch conftest to use `action_model_type="DiT-test"` now that
`action_num_heads` / `action_attention_head_dim` have been removed.
- Add action_head tests covering fully-padded loss (zero) and equivalence
of action_is_pad=None vs all-zeros mask.
- Remove obsolete `test_native_to_lerobot_wm_only` test.
* add VLA-JEPA documentation
Covers architecture overview, pretrained checkpoints, config reference,
training/eval commands for LIBERO-10, and guidance on fine-tuning for
single-camera datasets.
* add one-shot script to convert ginwind/VLA-JEPA checkpoints to safetensors (will remove once migrated)
* make default params more aligned with paper and pretrained models
- adding possibility of freezing qwen backbone and world model
- added tests for weight loading
* trying out to re-init the action head to avoid pretraining dimension mismatch
* allow different state dim and action dim
* removing missleading future_action_window_size to just use chunk_size
* lots of changes to make existing weights work, need to massively refactor the pre and post processing
* refactoring into using pre and post processor
* pre-commit cleanup
* fixing doc defaults args
Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>
* adressing dtype zeros issue
* adding guard for diffusers
* fixing training and exal examples
* trying to close success rate gap
* fix qwen norm layer output libero eval is now as expected
* adding instructions for different embodiement + fixing some tests
* smol fix to avoid having default CPU device when training
* fixing misconception about multiview / singleview handling
* removing conversion script
* adding licences
* adding .mdx docs and shortening polivy_vla_jepa_README.md
* removing useless pre-processor
* cleanup
* removing swish in favor of silu
* adding configuration gripper index and threshold
* fixing simlink
---------
Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>
Co-authored-by: ginwind <ginwind@mail.ustc.edu.cn>
embed_language_tokens already applies Gemma's sqrt(hidden) normalizer
(GemmaTextScaledWordEmbedding, transformers >=5.4.0). pi052 multiplied FAST
action-token and autoregressive subtask-text embeddings by sqrt(emb_dim) on
top of that, double-scaling them (~2048x). Remove the manual scaling so FAST
and text tokens are single-scaled, consistent with the pi05 fix and OpenPI.
Co-authored-by: Cursor <cursoragent@cursor.com>
transformers >=5.4.0 (PR #44432) makes Gemma's embed_tokens a
GemmaTextScaledWordEmbedding that already multiplies token embeddings by
sqrt(hidden_size). The manual `* sqrt(embed_dim)` applied on top therefore
double-scaled text (~2048x instead of ~45x), breaking VLM alignment for
models trained/run on stock transformers. Remove the manual scaling and rely
on embed_tokens' internal normalizer (matches main #3603). Image features
stay raw (un-normalized), as before.
Co-authored-by: Cursor <cursoragent@cursor.com>
OpenPI (pi0 and pi0-FAST) multiplies language token embeddings by
sqrt(embed_dim) — the Gemma embedder normalizer — before the transformer.
LeRobot pi0/pi0_fast omitted it, leaving text tokens ~45x under-scaled
relative to the residual stream (same class of bug as the pi05 image
scaling). pi0: applied in embed_prefix's lang_embed_func. pi0_fast:
applied inside embed_language_tokens so prompt, FAST action tokens, and
autoregressive next-token embeds are all scaled consistently.
Co-authored-by: Cursor <cursoragent@cursor.com>
lerobot/pi05_base was trained in the OpenPI/big_vision regime where image
(soft) tokens are NOT multiplied by the Gemma embedder normalizer
(sqrt(hidden_size)) — only text tokens are. Scaling image features here
over-scaled them ~45x, breaking the pretrained vision-language alignment
and yielding ~0% closed-loop success on RoboCasa across all pi05 runs.
Co-authored-by: Cursor <cursoragent@cursor.com>
Bring the authoritative annotation pipeline from the annotation branch.
The annotation surface is forced to EXACTLY match feat/language-annotation-
pipeline (the annotation branch is the source of truth for annotation
code), which also removes smolvla's stale copies:
- deleted: steerable_pipeline/vocabulary.py, tests/annotations/test_
vocabulary.py, prompts/module_0_vocabulary.txt, module_1_action_record
.txt, module_3_vqa.txt, module_1_plan.txt, and the old module_* prompt
names (now plan_*/interjections_*/vqa.txt).
- synced: all of src/lerobot/annotations/, lerobot_annotate.py,
examples/annotations/, tests/annotations/, datasets/language.py,
tests/datasets/test_language.py, docs/annotation_pipeline.mdx.
Non-annotation conflicts resolved by union (keeping both branches' intent):
- pyproject.toml: keep smolvla's pi extra (+sentencepiece) and add the
molmoact2 extra from main.
- policies/factory.py: keep both dataset_repo_id (pi052 FAST tokenizer)
and dataset_meta (both are referenced); union the policy-type docstring.
- scripts/lerobot_train.py: keep smolvla's pi052 / use_relative_actions
processor-rebuild block.
- uv.lock: regenerated from the merged pyproject.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first-token parity check re-tokenized the decoded (stripped) inference
string, so the leading-space SentencePiece variant always mismatched the
training argmax — a false "DIVERGED" alarm. Remove the autoregressive
inference print and parity comparison (and the now-dead per-sample
select_message generation), keeping only the prompt, ground-truth target,
and teacher-forced argmax accuracy.
Co-authored-by: Cursor <cursoragent@cursor.com>
Collapse the remaining multi-line field comments / docstrings in config.py
to single lines (or two where a knob genuinely needs it), keeping the
essential rationale. Comments only — no field or behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the optional structured per-subtask action records — not a feature
we want to ship.
* language.py: remove 'action_record' from CORE_STYLES + PERSISTENT_STYLES
(and the matching assertion in tests/datasets/test_language.py).
* config.py: delete ActionRecordsConfig (verb/grasp vocabularies,
frames_per_subtask, emit_record_row) and the PlanConfig.action_records
field.
* plan_subtasks_memory.py: delete _extract_action_record and the
run_episode block that emitted style='action_record' rows; drop the
now-unused json / to_image_blocks imports.
* remove the plan_action_record.txt prompt.
* run_hf_job.py: drop the action_records comment.
Verified: 40 tests pass; pre-commit (ruff, mypy, bandit) clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tighten the remaining multi-line comment blocks in config.py (derive_task,
frames/window, describe_first, action-record/vqa/vlm fields, video_backend,
repo ids, executor) to 1-3 lines each. Also fix a stale path typo
('examples/annotation' -> the docstring now just says HF Jobs). Comments
only — no field or behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two behavior-preserving simplifications:
* plan_subtasks_memory.run_episode: the task_aug 'axes' and free-form
branches built identical deduped rows via copy-pasted seen/append
loops. Collapse to one branch that picks the variant source, then a
shared _task_aug_rows() helper does the dedup + row build (-~25 LOC).
* writer: _normalize_persistent_row / _normalize_event_row shared the
same camera-validate + struct construction. Extract _normalize_row(),
keeping the exact key order (the parquet struct schema is inferred
from insertion order, so timestamp must stay between style and camera).
docs: 'Which modules run' is now a table giving each module's on/off flag
(--plan.enabled / --interjections.enabled / --vqa.enabled) and what it
turns off.
Verified: 40 tests pass (incl. test_writer struct round-trip); pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pyproject annotations-extra comment still described the removed
vllm/transformers in-process backends ('vllm preferred ... transformers
fallback', '_make_vllm_client'); rewrite it for the openai-only reality
and trim it. Also condense the conftest lazy-import NOTE. Comments only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Dead code (defined but never referenced anywhere in src/tests/examples):
* reader.py: keyframe_indices, episode_frame_timestamps, lookup_data_path,
and the now-orphaned gather_data_paths + episode_offsets_per_path
(lookup_data_path was their only caller).
* staging.py: iter_staged_episodes.
* writer.py: normalize_rows_for_writer.
* config.py VlmConfig: json_mode, batch_size, tensor_parallel_size,
gpu_memory_utilization, trust_remote_code — consumed only by the
in-process vllm/transformers backends that were removed; the openai
auto-serve path carries those vLLM flags via serve_command instead.
Kept max_model_len (still used as the serve-command default).
* config.py TaskAugAxesConfig.total property.
Docs: new 'Key options' section in annotation_pipeline.mdx — grouped
tables (dataset in/out, module toggles, --vlm.*, --plan.*, interjections
+ vqa) describing the flags users actually reach for, with defaults.
config.py: compact the verbose field comments + ActionRecordsConfig /
TaskAugAxesConfig docstrings; fix two stale 'verify' references (the
verify pass was removed — it's describe -> segment now) and the stale
'renders record back to subtask text' note (that path was removed).
vlm_client docstring no longer mentions the removed json_mode field.
Verified: tests/annotations + tests/datasets/test_language +
tests/scripts/test_lerobot_annotate (40 passed); pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trim the long inline comment blocks (effective_task / task_aug, action
records, plan-boundary rows, plan-update span closing, windowed +
coverage-stitch sections) and the _generate_plan / run_plan_updates
docstrings to a few lines each. No behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same flags and rationale, condensed — each plan-module flag now has a
short one/two-line comment instead of a paragraph.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Top-down flow (read episodes → 3 modules fan out → validator → writer →
parquet) with aligned boxes, instead of the cramped bordered version.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrite annotation_pipeline.mdx in plainer, easier-to-read language
(shorter sentences, active voice, a plain-text intro), add an ASCII
'How it fits together' architecture diagram, and remove the
'Reproducibility via seed and prompt hashes' section. Content/links are
preserved; only wording and structure change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The example already pins '@main'; update the doc step and the script
docstring from 'the branch under test' to 'lerobot (from main)' now that
the pipeline is merging to main.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bugs
* validator: don't re-raise on unknown style. The second column_for_style
lookup (used to route persistent vs event) now sits in try/except so an
unknown style is recorded by _check_column_routing and skipped instead
of crashing the whole validation pass.
* general_vqa._target_cameras: when restrict_to_default_camera is set but
the configured camera_key isn't one the provider exposes, warn and fall
back to all cameras instead of returning a phantom key that KeyErrors
deep in frame decode.
* interjections: clamp interjection timestamps to frame_timestamps[0]
rather than a hardcoded 0.0 (datasets can start at non-zero t).
Docs / code drift
* annotation_pipeline.mdx: drop the phantom 'vocabulary discovery / phase
0 / --vocabulary.* / canonical_vocabulary.json' section (none of it
exists); describe the real describe->segment + coverage-stitch flow.
Soften the src/lerobot/tools/ + TOOL_REGISTRY reference to 'not part of
this PR' (matches tools.mdx, which already marks the runtime layer as
not-yet-implemented). Fix the --push_to_hub/--new_repo_id wording. Note
the default is now a single h200. Add a 'Contributing new modules'
section inviting module / prompt / quality contributions.
* executor docstring: six phases, no phantom phase 0.
run_hf_job.py
* add the Apache 2.0 license header (was flagged repeatedly).
* default to a single GPU: flavor=h200, parallel_servers=1, num_gpus=1
(scale to h200x4 noted in the docstring).
* pin the install to @main instead of the feature branch (won't break
after merge).
Naming / cleanup
* rename dest_repo_id -> new_repo_id across config / script / example /
test to match the LeRobot dataset edit tools.
* rename prompt templates module_N_*.txt -> descriptive (plan_*,
interjections_*, vqa.txt) and update every load_prompt() call.
* remove dead _messages_to_prompt (used only by the removed in-process
backends).
* declare _warned_decode_fail (frames) and _warned_no_camera (vqa) as
real init=False dataclass fields instead of getattr monkey-patches.
* scope bandit B607 to the two ffmpeg subprocess.run sites via
'# nosec B607' and drop it from the global skip list.
Tests
* fix stale canned-VLM markers ('ONE realistic interruption' ->
'compact interjection', 'Update the memory' -> 'compressed semantic
memory') and drop the dead 'concise hierarchical PLAN' plan responders
(plan generation is deterministic now) in run_e2e_smoke,
test_pipeline_recipe_render, test_modules.
* run_e2e_smoke now asserts interjection + speech rows are produced so a
stale marker can't silently pass again.
* drop remaining 'PR 1' / 'PR 2' references from test comments / names.
Verified: tests/annotations + tests/datasets/test_language +
tests/scripts/test_lerobot_annotate (31 passed); make-style E2E smoke
(interjections=1 speech_atoms=2); pre-commit (ruff, mypy, bandit,
prettier) clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
action_record is in PERSISTENT_STYLES but was missing from CORE_STYLES,
so STYLE_REGISTRY (= CORE_STYLES | EXTENDED_STYLES) didn't contain it and
the PERSISTENT_STYLES | EVENT_ONLY_STYLES <= STYLE_REGISTRY invariant in
test_style_registry_routes_columns failed. Add it to CORE_STYLES so the
registry, the persistent-set, and column_for_style() stay consistent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The shipped workflow is Hugging Face Jobs (examples/annotations/run_hf_
job.py): it serves the model with vLLM in the vllm/vllm-openai image and
the pipeline talks to it over the OpenAI-compatible API. The in-process
vllm / transformers local backends added surface (and the vllm
one pinned an old torch) without being part of that path, so they're
removed for now.
* vlm_client.make_vlm_client: keep only backend='openai' (+ 'stub'
rejected with the usual guidance). Requesting 'vllm'/'transformers'
now raises a clear 'not supported for now — use the HF Jobs flow'
error. Removed _make_vllm_client and _make_transformers_client.
* config: backend docstring updated (openai-only); default model_id
bumped to Qwen/Qwen3.6-27B to match run_hf_job.
* docs/annotation_pipeline.mdx: remove the '## Running locally'
section; the launcher description now says one vLLM server per GPU
over the OpenAI API, and the 'One Qwen-VL pass' note drops the
'vLLM/transformers fallback' wording.
Tests are unaffected (they construct StubVlmClient directly; nothing
referenced the removed backends).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The annotation tests had never actually run in CI (collection failed on
the missing 'datasets' extra); now that they do, three stale assertions
surfaced against the evolved pipeline:
* test_module1_plan_memory_subtask_smoke: the memory canned-responder
marker 'Update the memory' no longer appears in module_1_memory.txt
(now 'compressed semantic memory'), so the stub returned no memory
row and the {subtask,plan,memory} subset check failed. Marker
updated to match the current prompt.
* test_module2_mid_episode_emits_paired_interjection_and_speech: the
interjection marker 'Write ONE interjection' is now 'Write ONE
compact interjection' in module_2_interjection.txt, so 0 interjections
were emitted. Marker updated.
* tests/datasets/test_language.py::test_style_registry_routes_columns:
PERSISTENT_STYLES gained 'action_record' in this PR; add it to the
expected set.
These are test/prompt-marker syncs — no production behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fast Pytest 'dataset' tier failed collecting tests/datasets/test_video_
decoder_cache.py with 'Could not load libtorchcodec ... undefined symbol:
torch_dtype_float4_e2m1fn_x2' — a torch/torchcodec ABI mismatch.
Root cause: the annotations extra's vllm hard-pins an older torch
(via xformers/xgrammar -> torch 2.8). uv resolves a SINGLE unified lock
across all extras, so vllm capped torch to 2.8 for every tier —
including dataset, whose torchcodec 0.11.1 needs torch 2.11. The
result was torch 2.8 + torchcodec 0.11.1 installed together -> ABI break.
(main has no vllm, so it resolves torch 2.11 + torchcodec 0.11.1 cleanly.)
Fix: remove vllm from the annotations extra. It is not needed by
the shipped workflow — examples/annotations/run_hf_job.py gets vllm from
the vllm/vllm-openai image and talks to it over the OpenAI-compatible
API (--vlm.backend=openai), and vlm_client._make_vllm_client imports vllm
lazily. For the in-process --vlm.backend=vllm path, install vllm
separately (the ImportError now says so).
After the fix uv resolves torch 2.11.0 + torchcodec 0.11.1 (matching
main); uv lock --check is clean. The annotations extra still provides
datasets / transformers / openai.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fast Pytest Tests failed at COLLECTION in the base '--extra test' tier
with 'ModuleNotFoundError: No module named datasets': tests/annotations/
conftest.py imported the fixture dataset builder (-> lerobot.datasets ->
the HF 'datasets' lib + pandas/pyarrow), which only ship under the
'dataset' extra, so the whole annotations package crashed.
Fix uses the repo's proven module-level guard pattern (see
tests/datasets/test_language.py), NOT a conftest-level importorskip —
verified empirically that pytest.importorskip raised during conftest
*import* is treated as a collection ERROR (exit 1), while module-level
importorskip is a clean SKIP.
* conftest.py: import build_annotation_dataset LAZILY inside the
fixtures so the conftest itself imports cleanly in every tier.
* test_modules / test_validator / test_writer / test_pipeline_recipe_
render: add module-level pytest.importorskip('datasets') +
('pandas') before the pyarrow / lerobot.* imports (# noqa: E402 to
match the existing convention). pyarrow-importing modules place the
guard before the pyarrow import.
* tests/scripts/test_lerobot_annotate.py: same guard (its _push_to_hub
path imports lerobot.datasets).
Result:
- base / hardware / viz tiers (no dataset extra): annotation tests
skip cleanly; the rest of the suite runs -> exit 0.
- dataset tier: datasets present -> guards pass through -> annotation
tests run with the stub VLM. The pipeline modules import only
stdlib + relative + lerobot.datasets (no module-level datatrove /
vllm / openai), so they import fine there.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The docs pointed at src/lerobot/datasets/v30/, which does not exist.
Both scripts actually live in src/lerobot/scripts/:
- convert_dataset_v21_to_v30.py
- augment_dataset_quantile_stats.py
Updated the four references (one python -m module path and three
file-path invocations) to the correct location, matching each
script's own usage docstring.
* fix(train): enable relative action overrides for pretrained processors
Keep pretrained processor pipelines when use_relative_actions is enabled and
apply relative/absolute action processor settings through overrides. Rename the
relative action processor registry key to relative_actions_processor.
* fix(config): reject rename_map without pretrained checkpoint
Fail fast when rename_map is set during fresh initialization, since fresh
configs derive feature names from the current dataset and no rename is applied.
---------
Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>
``fit_fast_tokenizer`` previously called ``LeRobotDataset(repo_id,
episodes=[N])`` per sampled episode, which on v3-format datasets
routes through HF datasets' split lookup and raises ``ValueError:
Instruction "train" corresponds to no data!`` on every episode. On
``pepijn223/robocasa_pretrain_human300_v4`` (32 k episodes) this looped
through 13,293 skipped episodes for ~2.5 h before the NCCL watchdog
killed the run via the 2 h ALLREDUCE timeout (job 22182985).
Switch to reading the ``action`` column directly from the dataset's
``data/chunk-*/file-*.parquet`` shards (same pattern as the audit
scripts). Verified end-to-end on the 32 k-episode dataset: 1000 chunks
collected from 1000 episodes in 70.7 s.
Co-authored-by: Cursor <cursoragent@cursor.com>
Quality-gate fix: ruff-format/markdown prettier hook reflow of the
annotation pipeline doc. No content change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Quality-gate fixes after the main merge:
* UP037: drop redundant quotes from PlanConfig forward-ref annotations
(action_records / task_aug_axes) — safe under 'from __future__ import
annotations'.
* ruff format applied to config.py, executor.py, general_vqa.py,
plan_subtasks_memory.py, validator.py, lerobot_annotate.py.
No behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Long episodes no longer get sparse subtasks. Previously a long episode
was subsampled to max_video_frames=32 across its whole duration (~1
frame/4s for a 2-min clip). New opt-in windowing keeps a CONSTANT
frames_per_second density by splitting the episode into fixed-length
windows and running the subtask chain per window.
New PlanConfig.subtask_window_seconds (default 0.0 = off). When > 0 and
the episode is longer than one window:
* episode is split into consecutive [w0, w1] windows of this length
* each window's frames are sampled at frames_per_second (so a 32s
window at 1 fps = 32 frames, filling but not exceeding the per-call
context budget)
* the full describe -> segment -> verify chain runs PER window, in
window-relative time [0, L]; spans are offset back to absolute
* all windows' spans are merged, frame-snap-deduped, and stitched into
one contiguous whole-episode cover
Implementation:
* _episode_video_block / _video_message / _describe_episode /
_verify_subtasks gain an optional window=(w0,w1); when set they
embed frames sampled in that absolute range at frames_per_second
(video_url path skipped — it's whole-episode).
* _clean_spans gains bounds= (override clamp range, for window-relative
spans) and dedupe= (skip frame-snap until the merged absolute set).
* new _generate_subtasks_windowed + _subtasks_for_window orchestrate
the loop; _generate_subtasks branches to them when window_s > 0.
run_hf_job.py: --plan.subtask_window_seconds=32 (32s windows at 1 fps).
Cost scales with episode length (chain calls × ceil(duration/window)).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Swap the annotation VLM from Qwen3.6-35B-A3B (sparse MoE, ~3B active)
to Qwen3.6-27B (dense, 27B all-active). Per Scale's dense-captioning
study, model capacity is the #1 lever and the dominant failure is
visual grounding — both helped by ~9x more active params. Qwen3.6-27B
is a vision-language model (vision encoder, image + video), same family
so the chat template / video handling / enable_thinking=false flag are
unchanged, and at 27B dense it still fits one H200 per server, so the
two-parallel-server layout (TP=1, one per GPU) is preserved — no
throughput-layout change, just a much stronger model.
Kept: parallel_servers=2, num_gpus=2, max-model-len 32768 (the 32-frame
embedded budget is ~10k tokens, well under), gpu-mem 0.8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adopt the one prompt technique Scale's dense-captioning study found
reliably positive: targeted, verb-scoped, visually-grounded
disambiguation rules. Their lesson was that such a rule must fire ONLY
on the spatial situation it names (their narrow 'Stack vs Put' rule
helped; an over-broad directional 'Scoop' rule bled into other verbs
and hurt), so each rule here is phrased visually and scoped to one
confusable pair:
* stack-vs-put (on top of an object vs on a surface)
* insert-vs-put (fitted slot vs surface)
* pick-up/retrieve-vs-put (decide by which way the OBJECT moves:
gripper closes + object moves with hand = pick up; gripper opens +
object stays = put — directly targets Scale's dominant
direction-flip failure)
* pour-vs-put (tilt + flow vs untilted move)
This is the highest-confidence, lowest-risk change from the Scale
findings; our pipeline already aligns with their 'avoid' list (no
temporal tokens, no overlays, no fancy sampling, no sequential context
injection, uniform sampling, describe-don't-predict framing).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switching the plan module to embedded frames (use_video_url=false)
exposed a context overflow: at frames_per_second=2.0 with the old
max_video_frames=128 default, a 480x640 episode embeds ~128 frames ≈
33-39k vision tokens, over the model's 32768 context — every plan call
died with 'Input length exceeds maximum context length' (HTTP 400),
crashing the whole annotation job.
The video_url path never hit this because the server downsampled; the
embedded path sends every sampled frame, so the frame count is a hard
token budget.
Fix:
* config default max_video_frames 128 -> 32 (~8-10k vision tokens,
comfortable headroom for the prompt + describe/verify passes).
Frames are still sampled UNIFORMLY across the whole episode, so
longer episodes are subsampled, not truncated — full temporal
coverage preserved, just coarser density.
* run_hf_job.py: frames_per_second 2.0 -> 1.0, explicit
--plan.max_video_frames=32, with a comment explaining the token
budget and the 'do not raise toward 128 with embedded frames' rule.
Only the plan module embeds the full episode; VQA (1 frame/tick) and
interjections (4-frame window) were never at risk.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>