Goal:
Create a high and low policy, the high policy take obs + text and return text, its a VLM
The low policy is a VLA that take text returned by the high and ouptut actions


High policy is ran every one second or when user send prompt


Synthetic data generation:
D demo -> teleop data with a global task annotation
D label -> segment data into short skills (one to three seconds)
D syn: p-gen will create the high level prompt user might gave to p-hi 
Given D label prompt p-gen to imagine appropriate action, take images, ALL PRIOR skill labels in the episode: ℓ̂₀, …, ℓ̂ₜ₋₁

“Given the scene + all previous steps + current needed skill ℓ̂₅,
generate a user request that logically leads to ℓ̂₅.”

Train:
Phi(lt| images, global label) cross entropy - next token predictions
Plow(At| images, qt, lt)