Select only supervised text and FAST action-code positions before cross-entropy to avoid full-vocabulary loss tensors over padded sequences.
Co-authored-by: Cursor <cursoragent@cursor.com>
Mask the FAST auxiliary loss to discrete action-code tokens so wrapper formatting tokens do not affect action co-training.
Co-authored-by: Cursor <cursoragent@cursor.com>