add multipe timesteps

2026-05-15 08:39:49 +00:00 · 2025-08-27 16:34:22 +02:00
parent 450be9d7d1
commit 2a901f8134
2 changed files with 9 additions and 8 deletions
@@ -605,11 +605,11 @@ def generate_causal_mask(T: int, device=None) -> Tensor:
 def extract_visual_sequence(batch: dict[str, Tensor]) -> Tensor:
    # Accept various image key formats from datasets
    # With delta_indices, the dataset provides temporal sequences automatically
-    
+
    # List of possible image keys to check, in order of preference
    possible_keys = [
        OBS_IMAGES,  # 'observation.images'
-        OBS_IMAGE,   # 'observation.image'  
+        OBS_IMAGE,  # 'observation.image'
        "observation.images.image",  # nested format from some datasets
    ]

@@ -692,11 +692,11 @@ def pairwise_ranking_loss(logits: Tensor, target: Tensor, margin: float = 0.1, n

 def zscore(x: Tensor, eps: float = 1e-3) -> Tensor:
    """Z-score normalization with numerical stability.
-    
+
    Args:
        x: Tensor of shape (B, T) where B is batch size, T is sequence length
        eps: Small epsilon for numerical stability
-    
+
    Returns:
        Z-scored tensor of same shape as input
    """
@@ -709,7 +709,7 @@ def zscore(x: Tensor, eps: float = 1e-3) -> Tensor:
    if T == 1:
        # Single timestep: use tanh to bound values instead of z-score
        return torch.tanh(x * 0.1)
-    
+
    # Multiple timesteps: compute z-score across time dimension for each batch
    mean = x.mean(dim=1, keepdim=True)  # (B, 1)
    std = x.std(dim=1, keepdim=True, unbiased=False)  # (B, 1)
@@ -43,6 +43,7 @@ Eval:
 Implement something like voc score , or ROC rank order correlation between reward leanredna and ev reward from sim, or use something else to do additional evaluation

 Ideas:
+
 - Incorporate training on multiple horizons: as in label same dataset for longer horizons: make a sandwich (long), put cheese on bread (medium) and even smaller horizons: go down or close gripper (small)
 - Incorporate navigation goals “walk towards the kitchen”, make sure we fix CLIP contrastive learning issue of positional text misunderstanding where model doesnnt learn difference between "horse right of cow" and "horse left of cow" “Move right” potentially train with more other data or even actionable world models such as Genie 3 (https://deepmind.google/discover/blog/genie-3-a-new-frontier-for-world-models/)

@@ -123,9 +124,9 @@ Default weights: $\lambda_{\text{prog}}=1.0$, $\lambda_{\text{spatial-nce}}=0.5$
 - Implement eval score or metric that is robust and can deal with generalization/is a good metric to try different architectures. And use it in an eval jupyter notebook with visalization of the live reward next to the video for part of the dataset: VOC score and score with correct and incorrect language captions [x]
 - Do first training [x]
 - Try different losses []
-    - Only vlc loss then eval []
-    - Only rewind loss then eval []
-    - Vlc + rewind loss then eval []
+  - Only vlc loss then eval []
+  - Only rewind loss then eval []
+  - Vlc + rewind loss then eval []
 - Cleanup code
 - Switch to DINO v3 as encoder Base 86 M: https://huggingface.co/facebook/dinov3-vitb16-pretrain-lvd1689m with HuggingFaceTB/SmolLM2-135M-Instruct ?
 - Add more artificial text to dataset generated by vlm (google gemini) []