mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-21 19:49:49 +00:00
fix(smolvla2): align flow_loss_weight default with Pi 0.5 paper's α=10
Pi 0.5 paper §IV.D Eq. (1) sets the loss balance to α=10 between text CE and flow MSE: actions are the primary output and the flow head should dominate the gradient signal. SmolVLA2 was defaulting both weights to 1.0, which inverts that — text CE (~0.5-2.0 nats) ends up larger than flow MSE (~0.1-1.0), so the action expert gets less gradient than the LM head despite being the primary task. Match the paper's split: text_loss_weight=1.0, flow_loss_weight=10.0. Same as ``pi052`` (the new full reproduction policy). Also pin the values explicitly in the SLURM launcher so the choice is visible and overridable per-run rather than buried in the config default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -63,6 +63,8 @@ accelerate launch --multi_gpu --num_processes="$NUM_PROCESSES" \
|
|||||||
--policy.compile_model=false \
|
--policy.compile_model=false \
|
||||||
--policy.device=cuda \
|
--policy.device=cuda \
|
||||||
--policy.tokenizer_max_length=512 \
|
--policy.tokenizer_max_length=512 \
|
||||||
|
--policy.text_loss_weight=1.0 \
|
||||||
|
--policy.flow_loss_weight=10.0 \
|
||||||
--steps="$STEPS" \
|
--steps="$STEPS" \
|
||||||
--policy.scheduler_decay_steps="$STEPS" \
|
--policy.scheduler_decay_steps="$STEPS" \
|
||||||
--batch_size="$BATCH_SIZE" \
|
--batch_size="$BATCH_SIZE" \
|
||||||
|
|||||||
@@ -69,12 +69,23 @@ class SmolVLA2Config(SmolVLAConfig):
|
|||||||
matches its training distribution."""
|
matches its training distribution."""
|
||||||
|
|
||||||
# Loss weights --------------------------------------------------------
|
# Loss weights --------------------------------------------------------
|
||||||
|
# Pi 0.5 paper §IV.D (Eq. 1) sets α = 10 between the text-CE term
|
||||||
|
# and the flow-MSE term: L = H(text) + α * ‖ω - a - f_θ‖². The
|
||||||
|
# rationale is that actions are the primary output and the flow
|
||||||
|
# head should dominate the gradient signal; text is supervised as
|
||||||
|
# an auxiliary task and its CE scale (~0.5-2.0 in nats) tends to
|
||||||
|
# be larger than the flow MSE scale (~0.1-1.0), so without
|
||||||
|
# up-weighting the action head gets starved. We mirror the paper's
|
||||||
|
# split here: text_loss_weight=1, flow_loss_weight=10.
|
||||||
text_loss_weight: float = 1.0
|
text_loss_weight: float = 1.0
|
||||||
"""Weight on the LM-head cross-entropy term. Set to ``0`` to disable
|
"""Weight on the LM-head cross-entropy term. Set to ``0`` to disable
|
||||||
text training entirely (reverts to flow-only / SmolVLA behaviour)."""
|
text training entirely (reverts to flow-only / SmolVLA behaviour)."""
|
||||||
|
|
||||||
flow_loss_weight: float = 1.0
|
flow_loss_weight: float = 10.0
|
||||||
"""Weight on the action-expert flow-matching term."""
|
"""Weight on the action-expert flow-matching term. Default 10.0
|
||||||
|
matches Pi 0.5 paper's α (§IV.D). Set lower if the text head is
|
||||||
|
underfitting relative to the action expert; set higher if the
|
||||||
|
action expert is degrading because text loss dominates."""
|
||||||
|
|
||||||
# Backbone training ---------------------------------------------------
|
# Backbone training ---------------------------------------------------
|
||||||
unfreeze_lm_head: bool = True
|
unfreeze_lm_head: bool = True
|
||||||
|
|||||||
Reference in New Issue
Block a user