lerobot

mirror of https://github.com/huggingface/lerobot.git synced 2026-07-06 17:41:47 +00:00

Files

T

Pepijn 5bb2da4da6 fix(pi052): VQA target format = "label <loc><loc>" not "<loc><loc> label"

The trained model collapsed to spewing 40+ <loc> tokens for *every*
prompt — subtask, memory, anything — because VQA targets were supervised
to *start* with <loc>. With ~25% of all text samples beginning with a
<loc> token, the LM head learned "Assistant: → <loc>" as a strong
attractor; once one loc is emitted, autoregression chains the rest.

Flip the format so every text target — subtask, memory, speech, AND VQA
— starts with a regular word. The model still learns the <loc>
vocabulary for the spatial portion of the answer, but loc can no
longer be the first generation step out of a clean prompt.

Examples:
  point  : "green box <loc0162><loc0759>"
  bbox   : "cube <loc0082>…<loc0409>"
  multi  : "blue <locs> ; yellow <locs>"

The runtime parser (parse_loc_answer) strips loc tokens and uses the
remainder as label, so it's order-tolerant and works under either
format. Old loc-first checkpoints still parse cleanly at inference;
new training will use label-first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 18:56:48 +02:00

annotations

revert(annotate): move memory + speech prompts to base PR (#3471 )

2026-05-19 14:17:52 +02:00

async_inference

feat(dependencies): minimal default tag install (#3362 )

2026-04-12 20:03:04 +02:00

cameras

refactor(imports): enforce guard pattern (#3382 )

2026-04-14 22:54:05 +02:00

common

refactor(imports): enforce guard pattern (#3382 )