add multipe timesteps

2026-07-24 18:26:11 +00:00 · 2025-08-27 16:34:22 +02:00
parent 450be9d7d1
commit 2a901f8134
2 changed files with 9 additions and 8 deletions
@@ -609,7 +609,7 @@ def extract_visual_sequence(batch: dict[str, Tensor]) -> Tensor:
    # List of possible image keys to check, in order of preference
    possible_keys = [
        OBS_IMAGES,  # 'observation.images'
-        OBS_IMAGE,   # 'observation.image'  
+        OBS_IMAGE,  # 'observation.image'
        "observation.images.image",  # nested format from some datasets
    ]
@@ -43,6 +43,7 @@ Eval:
 Implement something like voc score , or ROC rank order correlation between reward leanredna and ev reward from sim, or use something else to do additional evaluation
 Ideas:
 - Incorporate training on multiple horizons: as in label same dataset for longer horizons: make a sandwich (long), put cheese on bread (medium) and even smaller horizons: go down or close gripper (small)
 - Incorporate navigation goals “walk towards the kitchen”, make sure we fix CLIP contrastive learning issue of positional text misunderstanding where model doesnnt learn difference between "horse right of cow" and "horse left of cow" “Move right” potentially train with more other data or even actionable world models such as Genie 3 (https://deepmind.google/discover/blog/genie-3-a-new-frontier-for-world-models/)
@@ -123,9 +124,9 @@ Default weights: $\lambda_{\text{prog}}=1.0$, $\lambda_{\text{spatial-nce}}=0.5$
 - Implement eval score or metric that is robust and can deal with generalization/is a good metric to try different architectures. And use it in an eval jupyter notebook with visalization of the live reward next to the video for part of the dataset: VOC score and score with correct and incorrect language captions [x]
 - Do first training [x]
 - Try different losses []
-    - Only vlc loss then eval []
+  - Only vlc loss then eval []
-    - Only rewind loss then eval []
+  - Only rewind loss then eval []
-    - Vlc + rewind loss then eval []
+  - Vlc + rewind loss then eval []
 - Cleanup code
 - Switch to DINO v3 as encoder Base 86 M: https://huggingface.co/facebook/dinov3-vitb16-pretrain-lvd1689m with HuggingFaceTB/SmolLM2-135M-Instruct ?
 - Add more artificial text to dataset generated by vlm (google gemini) []