- wall_x: switch to SGD optimizer + explicit scheduler overrides.
The 4B-param model casts to bf16 internally, but AdamW's exp_avg/
exp_avg_sq states blow past the 22 GB GPU. Same fix we applied to
pi0/pi05/pi0_fast.
- xvla: fix rename_map. Dataset (libero_plus) exposes front/wrist
image keys; the model expects image/image2. Previous map was
direction-reversed and left the batch without any recognized
image feature.
Made-with: Cursor
Adam optimizer states (exp_avg + exp_avg_sq) require ~16GB extra on top of
model params and gradients for 4B parameter models, exceeding the 22GB GPU.
SGD has zero optimizer state overhead and profiling only measures
forward/backward timing anyway.
Also adds torch.cuda.empty_cache() after deterministic forward to release
transient memory before the training loop starts.
Made-with: Cursor
Enable --policy.dtype=bfloat16 and --policy.gradient_checkpointing=true
for pi0, pi0_fast, and pi05 profiling specs. Combined with use_amp=true,
this brings the 4B-param VLA models well within the 22GB GPU budget.
Made-with: Cursor
- Move cudnn_deterministic to per-spec train_args instead of hardcoding
it for all models. cuBLAS deterministic mode triggers internal errors
on Gemma-based models (pi0, pi05) during backward pass.
- Enable use_amp=true for pi0, pi0_fast, and pi05 to reduce memory
footprint from fp32 (~16GB weights alone) to bf16, fitting within
22GB GPU budget with room for activations and gradients.
- Small models (act, diffusion, multi_task_dit) still use deterministic
mode for reproducible profiling results.
Made-with: Cursor