Commit Graph

8 Commits

Author SHA1 Message Date
Pepijn 41166b39fb fix(train): synchronize EpisodeAwareSampler shuffling across ranks and gate dataset download per node (#3768)
* fix(datasets): expose a generator on EpisodeAwareSampler for distributed shuffle sync

In distributed training, accelerate can only synchronize the shuffle
permutation across ranks when the sampler exposes a generator attribute.
EpisodeAwareSampler shuffled via the global torch RNG, so disjoint batch
shards relied on every rank's global CPU RNG staying in lockstep forever;
any rank-asymmetric RNG consumption (e.g. eval rollouts on the main
process only) silently desynced the permutations and ranks trained on
overlapping/missing samples.

* fix(train): seed sampler generator and gate dataset download per node

- Pass a generator seeded with cfg.seed to EpisodeAwareSampler so
  accelerator.prepare registers it as the synchronized RNG and the
  shuffle order is reproducible.
- Gate the initial make_dataset call on is_local_main_process instead of
  is_main_process: the global main process only exists on node 0, so on
  every other node all local ranks were downloading the dataset and
  building the Arrow cache concurrently.
2026-06-11 11:07:42 +02:00
Steven Palma df0763a2bc feat(dependencies): minimal default tag install (#3362) 2026-04-12 20:03:04 +02:00
Steven Palma d90e4bcfd3 refactor(dataset): modular files (#3171)
* refactor(dataset): modular files

* refactor(dataset): update imports across the codebase
2026-03-15 23:58:09 -07:00
Steven Palma 9d3b62aa61 chore(dataset): basic house-keeping (#3170) 2026-03-15 22:12:09 -07:00
Steven Palma 7c2ec31793 refactor(datasets): module cleanup (#3169) 2026-03-15 20:42:15 -07:00
Michel Aractingi f55c6e89f0 Dataset v3 (#1412)
Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>
Co-authored-by: Remi Cadene <re.cadene@gmail.com>
Co-authored-by: Tavish <tavish9.chen@gmail.com>
Co-authored-by: fracapuano <francesco.capuano@huggingface.co>
Co-authored-by: CarolinePascal <caroline8.pascal@gmail.com>
2025-09-15 09:53:30 +02:00
Simon Alibert d4ee470b00 Package folder structure (#1417)
* Move files

* Replace imports & paths

* Update relative paths

* Update doc symlinks

* Update instructions paths

* Fix imports

* Update grpc files

* Update more instructions

* Downgrade grpc-tools

* Update manifest

* Update more paths

* Update config paths

* Update CI paths

* Update bandit exclusions

* Remove walkthrough section
2025-07-01 16:34:46 +02:00
Simon Alibert 974028bd28 Organize test folders (#856)
Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
2025-03-13 14:05:55 +01:00