From 56068d37ea87274dd9e4b466663431bf5a707b1b Mon Sep 17 00:00:00 2001 From: Pepijn Date: Fri, 15 May 2026 16:44:00 +0200 Subject: [PATCH] =?UTF-8?q?fix(smolvla2):=20default=20load=5Fvlm=5Fweights?= =?UTF-8?q?=3DTrue=20=E2=80=94=20don't=20train=20from=20scratch?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit SmolVLAConfig defaults ``load_vlm_weights=False``. With that and no ``--policy.path``, ``SmolVLMWithExpert.__init__`` builds the VLM via ``SmolVLMForConditionalGeneration(config=...)`` — i.e. a fully **random-initialised** 500M backbone, including a random ``lm_head``. For plain SmolVLA that's a deliberate "pre-train the expert" mode. For SmolVLA2 it's a footgun: the high-level text head *is* the SmolVLM2 ``lm_head``. Training subtask prediction from a random language model can only memorise — which is exactly the repetition collapse seen on the real robot ("the arm the arm the arm …"). SmolVLA2 now defaults ``load_vlm_weights=True`` so every run fine-tunes the pretrained ``HuggingFaceTB/SmolVLM2-500M-Video-Instruct`` backbone (vision tower + language model + lm_head). The action expert still trains from scratch on the robot data (standard SmolVLA fine-tuning); start it from pretrained too by fine-tuning a full ``lerobot/smolvla_base`` checkpoint via ``--policy.path``. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../smolvla2/configuration_smolvla2.py | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/src/lerobot/policies/smolvla2/configuration_smolvla2.py b/src/lerobot/policies/smolvla2/configuration_smolvla2.py index 8b7b1e5e8..39374f686 100644 --- a/src/lerobot/policies/smolvla2/configuration_smolvla2.py +++ b/src/lerobot/policies/smolvla2/configuration_smolvla2.py @@ -95,6 +95,25 @@ class SmolVLA2Config(SmolVLAConfig): effectively reduces SmolVLA2 back to SmolVLA's flow-only training, which is occasionally useful for ablations.""" + load_vlm_weights: bool = True + """Load the pretrained SmolVLM2 backbone weights (vision tower + + language model + ``lm_head``) instead of random-initialising them. + + ``SmolVLAConfig`` defaults this to ``False`` because the original + SmolVLA pre-training run trained the VLM body itself. For SmolVLA2 + that default is a footgun: the text head **is** the SmolVLM2 + ``lm_head``, and the high-level subtask supervision is hopeless if + it starts from a random language model — it can only memorise. + SmolVLA2 therefore defaults this to ``True`` so every run fine-tunes + from the pretrained ``vlm_model_name`` checkpoint + (``HuggingFaceTB/SmolVLM2-500M-Video-Instruct``). + + Note this loads the *VLM backbone* pretrained; the action expert + still trains from scratch on the robot data (standard SmolVLA + fine-tuning). To also start the action expert from pretrained + weights, fine-tune from a full ``lerobot/smolvla_base`` checkpoint + via ``--policy.path``.""" + # Per-component prompt dropout (Pi0.7 §V.E) --------------------------- # At training, randomly drop non-target context messages whose # content was substituted from the named recipe binding. Forces