🐛 fix v30_to_v21 ArrowTypeError on pandas extension dtypes

`table.slice(...).to_pandas()` produces pandas ExtensionArrays for
`array[float32]` columns (e.g. `observation.states.end.orientation`)
on newer pandas/pyarrow combos, which then fail in
`pa.Table.from_pandas` inside `Dataset.from_pandas(...).to_parquet(...)`.

Skip the pandas round-trip and wrap the `pa.Table` slice in a
`Dataset` directly with `Dataset(episode_table).to_parquet(...)`.
This preserves the HuggingFace dataset metadata that `Dataset.to_parquet`
writes, while avoiding the ExtensionArray crash. No version pin on
datasets/pyarrow needed.

Closes #87
This commit is contained in:
FennMai
2026-04-30 07:03:03 +00:00
parent 8aa7343137
commit 723bd71cf2
@@ -181,7 +181,7 @@ def convert_data(root: Path, new_root: Path, episode_records: list[dict[str, Any
f"episode_index={episode_index}, length={length}"
)
episode_table = table.slice(start, length).to_pandas()
episode_table = table.slice(start, length)
dest_chunk = episode_index // DEFAULT_CHUNK_SIZE
dest_path = new_root / LEGACY_DATA_PATH_TEMPLATE.format(
@@ -189,7 +189,7 @@ def convert_data(root: Path, new_root: Path, episode_records: list[dict[str, Any
episode_index=episode_index,
)
dest_path.parent.mkdir(parents=True, exist_ok=True)
Dataset.from_pandas(episode_table).to_parquet(dest_path)
Dataset(episode_table).to_parquet(dest_path)
def _group_episodes_by_video_file(