* fix(record): pass rename_map to make_policy in lerobot-record
Fixes#3181. The rename_map from dataset config was used for preprocessor
construction but not passed to make_policy(), causing feature mismatch
errors when camera key names differ between dataset and model config.
make_policy() already accepts a rename_map parameter and uses it to skip
visual feature consistency validation when remapping is active, but
lerobot_record.py was not passing it through.
* style: fix ruff format for ternary expression
---------
Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
Replace the `with profiler or nullcontext():` wrap around the entire
training loop with explicit `profiler.start()` / `profiler.finalize()`
calls, and tighten `_section(...)` regions in `update_policy` to only
wrap the hot calls (forward / backward / optimizer.step).
This avoids ~120 lines of pure re-indentation noise while keeping the
exact same artifacts on disk and the same public behavior.
lerobot_train.py diff vs main: 267 -> 29 changed lines.
Made-with: Cursor
groot's only policy-specific dependency is flash-attn, which has no
prebuilt wheel for torch 2.10 and requires nvcc to build from source.
The CI image is based on nvidia/cuda:12.4.1-base, which ships the
CUDA runtime but not the compiler toolkit, so the source build fails
with `/usr/local/cuda/bin/nvcc: No such file or directory`. The
repo's own pyproject.toml already carries a TODO acknowledging this:
gr00t needs bespoke flash-attn install steps.
Treat this as an environmental limitation rather than a regression:
dep-install failures for groot are logged via `::warning::` and skip
the policy without failing the job. Dep-install failures for any
other policy remain fatal, so real regressions still surface.
Made-with: Cursor
The dashboard expects per-phase timings (forward_s, backward_s,
optimizer_s) in step_timing_summary.json, but only total_update_s
and dataloading_s were collected — leaving every chart except
dataloading empty.
Add a lightweight TrainingProfiler.section(name) context manager
that times a region with torch.cuda.synchronize before and after
(so GPU work is captured, not just the kernel-launch latency) and
accumulates per-section samples into step_timing_summary.json.
Wrap forward, backward (incl. grad clip), and optimizer (incl.
zero_grad and scheduler.step) in update_policy with these sections.
When profiling is off (profiler=None) the wrappers become no-ops,
so training performance is unchanged outside CI.
Made-with: Cursor
groot depends on flash-attn, which fails to build in uv's default
isolated build env because it doesn't declare torch as a build-time
dependency. Torch is a core lerobot dep and is already present in
the target venv when groot is synced, so we can safely disable
build isolation just for flash-attn. The flag is a no-op for
policies that don't pull in flash-attn.
Made-with: Cursor
- wall_x: switch to SGD optimizer + explicit scheduler overrides.
The 4B-param model casts to bf16 internally, but AdamW's exp_avg/
exp_avg_sq states blow past the 22 GB GPU. Same fix we applied to
pi0/pi05/pi0_fast.
- xvla: fix rename_map. Dataset (libero_plus) exposes front/wrist
image keys; the model expects image/image2. Previous map was
direction-reversed and left the batch without any recognized
image feature.
Made-with: Cursor
Add groot, xvla, diffusion and wall_x (wall-oss-flow) to the smoke
profiling filter and switch the runner to per-policy dependency
resolution. Each policy now gets its own `uv sync --extra <policy>`
pass followed by a profiling run, so heavy or conflicting extras
(flash-attn, peft, diffusers, etc.) can never block another policy's
profiling. A failure in one policy is logged and surfaces a non-zero
exit at the end instead of aborting the matrix.
Made-with: Cursor
Adam optimizer states (exp_avg + exp_avg_sq) require ~16GB extra on top of
model params and gradients for 4B parameter models, exceeding the 22GB GPU.
SGD has zero optimizer state overhead and profiling only measures
forward/backward timing anyway.
Also adds torch.cuda.empty_cache() after deterministic forward to release
transient memory before the training loop starts.
Made-with: Cursor
Move all profiling orchestration out of lerobot_train.py and
TrainPipelineConfig into a TrainingProfiler class in profiling_utils.py.
- lerobot_train.py: ~74 lines of profiling code reduced to ~7 call sites
- TrainPipelineConfig: 10 profile_* fields reduced to 2 (mode + output_dir)
- update_policy: reverted to clean main-branch signature (no timing_collector)
- TrainingProfiler encapsulates torch profiler, timing collection,
deterministic forward artifacts, and all output writing
- CI script (run_model_profiling.py) unchanged—it only passes the 2 kept fields
Made-with: Cursor
Enable --policy.dtype=bfloat16 and --policy.gradient_checkpointing=true
for pi0, pi0_fast, and pi05 profiling specs. Combined with use_amp=true,
this brings the 4B-param VLA models well within the 22GB GPU budget.
Made-with: Cursor
- Move cudnn_deterministic to per-spec train_args instead of hardcoding
it for all models. cuBLAS deterministic mode triggers internal errors
on Gemma-based models (pi0, pi05) during backward pass.
- Enable use_amp=true for pi0, pi0_fast, and pi05 to reduce memory
footprint from fp32 (~16GB weights alone) to bf16, fitting within
22GB GPU budget with room for activations and gradients.
- Small models (act, diffusion, multi_task_dit) still use deterministic
mode for reproducible profiling results.
Made-with: Cursor
Remove cProfile wrapping from the training loop and profiling utilities.
The torch profiler already captures fine-grained timing and operator
breakdowns; cProfile added redundant overhead without actionable
insight for GPU-bound models.
- Remove render_cprofile_summary, run_with_cprofile from profiling_utils
- Replace cProfile-wrapped calls in lerobot_train with direct calls
- Remove cprofile_summaries from artifact index in run_model_profiling
- Update tests to match
Made-with: Cursor
The previous commit moved these expressions from inline shell expansion
to job-level env: vars, but the profiling script runs inside a Docker
container. Job-level env vars are only visible in the runner, not inside
the container — they need explicit -e flags on the docker run command
(same pattern as HOST_GIT_COMMIT which was already forwarded).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pre-commit Quality gate flagged two issues:
1. ruff/isort: `from numbers import Real` must sort after
`from collections.abc import Callable` (stdlib alphabetical order).
2. zizmor (high): `github.head_ref`, `github.ref_name`,
`github.event.inputs.git_ref`, and `github.event.pull_request.head.sha`
were expanded directly in `run:` shell blocks, which zizmor flags as
attacker-controllable. Move all four into job-level `env:` vars
(GIT_REF, PR_NUMBER, HOST_GIT_COMMIT) so the shell only sees env-var
references — the same pattern the workflow already uses for
PROFILE_MODE, POLICY_FILTER, etc.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>