WIP: distributed terashuf #92

adammoody wants to merge 81 commits into bigscience-workshop:main from adammoody:distshuf
adammoody
adammoody add parallel merge using mpi
269af4eb
adammoody handle case where some ranks might have 0 items
9ba081be
adammoody add inclusive scan prefix sum
d29a7023
adammoody report more timing info
ed497132
adammoody Update megatron/data/indexed_dataset.py
e94f2a0c
adammoody Update megatron/data/indexed_dataset.py
687ff32f
adammoody rename total size variable for clarity
af595454
adammoody move translation to bin/idx file names a level deeper
4f648a0a
adammoody parallel merge for cached dataset
9f2ba6ae
adammoody add alltrue function
72d6c9c2
adammoody move collectives to new distdata class, add torch.distributed
8b67becb
adammoody drop unused prefix_sum function
3eca1f35
adammoody allow ranks to pass a list of files to be merged
a691b481
adammoody check that input dataset files exist
e4a34e2a
adammoody fix: using wrong doc_idx list for mmap
8b168cab
adammoody move init dist and collectives to distdata class
7a026938
adammoody add --merge option, move parallel/serial to their own functions
eca2940f
adammoody Merge branch 'main' into pmerge
b14491df
adammoody Update megatron/data/distdata.py
ec11281f
adammoody Update megatron/data/indexed_dataset.py
354d13bd
adammoody Update megatron/data/indexed_dataset.py
2dc3f7ad
adammoody Update megatron/data/indexed_dataset.py
980e9043
adammoody Update megatron/data/indexed_dataset.py
ebd20a6f
adammoody Update megatron/data/indexed_dataset.py
69b2f49b
adammoody Update megatron/data/indexed_dataset.py
50de06ac
adammoody drop extraneous numpy tolist calls
af290ad9
adammoody rename self.MPI to mpi4py
4b58c74c
adammoody handle case where no ranks have elements in their file
71a2fdcf
adammoody rename tokenize_start to time_start
73d3a247
adammoody drop unrelated comment in distdata.min
b9e69bea
adammoody add comment why pointers_shift is not None and add assert
da615c6d
adammoody note why pointers uses sizes count and offset values
c42f41f5
adammoody can just rely on rank 0 for the leading 0 element
a3a7d539
adammoody add write_list function
163310aa
adammoody determine element size
01b2be07
adammoody add checks for consistent element_size values
4b6e8ffa
adammoody check that at least one rank has a file to merge
ea085555
adammoody assert that torch backend is gloo or mpi
2524fce6
adammoody add collectives for assert and raise
ca14d48d
adammoody rename to allassert and allraise_if
d482f36f
adammoody check dtype instead of element_size
28d76f57
adammoody add uint32 to element_sizes table
f706108a
adammoody infer dtype from files being merged
f1228837
adammoody add write_header function to indexed dataset classes
57c012e0
adammoody call write_header internally from IndexedDataset classes
eed83271
adammoody return number of bytes written from write calls
a75cfc2c
adammoody Merge branch 'main' into pmerge
afcfcf95
adammoody move scatterv to distdata class
74b733a4
adammoody add functions to format status and error messages
dadb51b4
adammoody defer merge_files_dist to future PR
a2f8fa0f
adammoody open files using with, refresh comments
39e6cd74
adammoody rely on default torch datatypes
2a29d996
adammoody fix some status messages from preprocess script
d6fa8959
adammoody fix: exclusive scan computing pointers list
1216c0ab
adammoody Merge branch 'pointerfix' into pmerge
a64d3dab
adammoody fix: exclusive scan to compute mmap pointers list
fde439ec
adammoody abstraction to index and randomly access jsonl files
fb274bfe
adammoody rebase on parallel merge, replace mpi4py with distdata class
d428c025
adammoody note about seek
ba14351e
adammoody rename preprocess_dataset_mpi.py to preprocess_data_dist.py
852fdd0c
adammoody update usage comments at top of script
61f4b467
adammoody Merge branch 'pmerge' into mpijson
18881ae0
adammoody look for extension .jsonl
bd6f41fb
adammoody add progress messages
3488d0bc
adammoody rebuild index if mtime is old
1305fe93
adammoody store index values in network byte order
6bcac1fd
adammoody add magic value and format version number to index file
813d0683
adammoody Merge branch 'main' into mpijson
0510081b
adammoody clean up merge
1fea302d
adammoody clean up merge
d3603130
adammoody pass distctx instead of mpi_comm to IndexedJSON
20a43afe
adammoody move existence test and stat queries to distdata
7b083479
adammoody add exception handling
8d448bce
adammoody edit typos in comments
6f7519f0
adammoody close shared file if open fails on any rank
3f9078d4
adammoody add distributed shuffle
fbd38bfc
adammoody shuffle on each rank to keep rng in step
927fbc17
adammoody optimize broadcast and sample ident steps with numpy
d243847f
adammoody add timer for global shuffle step
6d7dd4b8
adammoody generate random seed if not specified
6bc9b943
adammoody add function to concatenate files
0b4a4cad

Login to write a write a comment.

Login via GitHub

Reviewers
No reviews
Assignees
No one assigned
Labels
Milestone