Megatron-DeepSpeed
distributed merge of per-rank Megatron data files
#55
Merged

distributed merge of per-rank Megatron data files #55

adammoody
adammoody add parallel merge using mpi
269af4eb
adammoody handle case where some ranks might have 0 items
9ba081be
adammoody add inclusive scan prefix sum
d29a7023
adammoody report more timing info
ed497132
adammoody adammoody force pushed from 488f440a to ed497132 4 years ago
adammoody
thomasw21
thomasw21 commented on 2021-08-12
adammoody Update megatron/data/indexed_dataset.py
e94f2a0c
adammoody Update megatron/data/indexed_dataset.py
687ff32f
adammoody rename total size variable for clarity
af595454
adammoody adammoody changed the title MPI-based parallel merge of per-rank Megatron data files WIP: MPI-based parallel merge of per-rank Megatron data files 4 years ago
adammoody
adammoody move translation to bin/idx file names a level deeper
4f648a0a
adammoody parallel merge for cached dataset
9f2ba6ae
adammoody add alltrue function
72d6c9c2
adammoody move collectives to new distdata class, add torch.distributed
8b67becb
adammoody drop unused prefix_sum function
3eca1f35
adammoody
adammoody allow ranks to pass a list of files to be merged
a691b481
adammoody check that input dataset files exist
e4a34e2a
adammoody fix: using wrong doc_idx list for mmap
8b168cab
adammoody move init dist and collectives to distdata class
7a026938
thomasw21
thomasw21 commented on 2021-08-16
adammoody
adammoody
adammoody add --merge option, move parallel/serial to their own functions
eca2940f
adammoody
adammoody Merge branch 'main' into pmerge
b14491df
adammoody Update megatron/data/distdata.py
ec11281f
thomasw21
adammoody Update megatron/data/indexed_dataset.py
354d13bd
thomasw21 thomasw21 assigned adammoody adammoody 4 years ago
thomasw21 thomasw21 added enhancement
adammoody Update megatron/data/indexed_dataset.py
2dc3f7ad
adammoody Update megatron/data/indexed_dataset.py
980e9043
adammoody Update megatron/data/indexed_dataset.py
ebd20a6f
adammoody Update megatron/data/indexed_dataset.py
69b2f49b
adammoody Update megatron/data/indexed_dataset.py
50de06ac
adammoody drop extraneous numpy tolist calls
af290ad9
adammoody rename self.MPI to mpi4py
4b58c74c
adammoody handle case where no ranks have elements in their file
71a2fdcf
adammoody rename tokenize_start to time_start
73d3a247
adammoody drop unrelated comment in distdata.min
b9e69bea
adammoody add comment why pointers_shift is not None and add assert
da615c6d
adammoody note why pointers uses sizes count and offset values
c42f41f5
adammoody can just rely on rank 0 for the leading 0 element
a3a7d539
adammoody add write_list function
163310aa
adammoody determine element size
01b2be07
adammoody add checks for consistent element_size values
4b6e8ffa
adammoody check that at least one rank has a file to merge
ea085555
adammoody adammoody changed the title WIP: MPI-based parallel merge of per-rank Megatron data files WIP: parallel merge of per-rank Megatron data files 4 years ago
adammoody assert that torch backend is gloo or mpi
2524fce6
adammoody add collectives for assert and raise
ca14d48d
adammoody rename to allassert and allraise_if
d482f36f
adammoody check dtype instead of element_size
28d76f57
adammoody add uint32 to element_sizes table
f706108a
adammoody infer dtype from files being merged
f1228837
adammoody add write_header function to indexed dataset classes
57c012e0
adammoody call write_header internally from IndexedDataset classes
eed83271
adammoody
thomasw21
adammoody return number of bytes written from write calls
a75cfc2c
adammoody Merge branch 'main' into pmerge
afcfcf95
adammoody move scatterv to distdata class
74b733a4
adammoody add functions to format status and error messages
dadb51b4
adammoody defer merge_files_dist to future PR
a2f8fa0f
adammoody open files using with, refresh comments
39e6cd74
adammoody rely on default torch datatypes
2a29d996
adammoody fix some status messages from preprocess script
d6fa8959
adammoody fix: exclusive scan computing pointers list
1216c0ab
adammoody Merge branch 'pointerfix' into pmerge
a64d3dab
adammoody fix: exclusive scan to compute mmap pointers list
fde439ec
adammoody note about seek
ba14351e
adammoody rename preprocess_dataset_mpi.py to preprocess_data_dist.py
852fdd0c
adammoody update usage comments at top of script
61f4b467
adammoody
adammoody restore commented print_rank_0 statements
22400f37
adammoody restore status message in mmap merge_file_
5cfcb955
adammoody drop mpi4py, sad :(
74c48831
adammoody Merge branch 'main' into pmerge
373e5145
adammoody add test case for parallel merge
78ab7158
adammoody add preprocess_data_dist test for serial merge
002b4032
adammoody
adammoody adammoody changed the title WIP: parallel merge of per-rank Megatron data files parallel merge of per-rank Megatron data files 4 years ago
adammoody adammoody changed the title parallel merge of per-rank Megatron data files distributed merge of per-rank Megatron data files 4 years ago
thomasw21
thomasw21 commented on 2021-08-19
stas00
adammoody improve error handling
ba763f7c
stas00
adammoody refactor get_pointers code
fa111591
adammoody bug fix in exscan
7e53fd34
adammoody further refactor get_pointers
53df36f4
adammoody move exscan collective for pointers outside of try block
c43348ff
adammoody clarify some comments
81c21dd5
adammoody include string 1k in name of test files
adee502b
adammoody use temporary file for index
13ae421d
thomasw21
adammoody
adammoody fix: implement scatterv from torch.distributed.scatter
f3e1b1dc
adammoody
thomasw21
thomasw21 commented on 2021-08-23
thomasw21
thomasw21 commented on 2021-08-24
thomasw21
thomasw21 commented on 2021-08-24
adammoody switch to pad method in torch.nn.functional
42962e1b
adammoody return data received in scatterv as new tensor
9a2f3838
adammoody raise exception if conflicting scratch and merge options
15b7603a
adammoody use allraise method from distdata in preprocess_data_dist
4adaddd7
thomasw21
thomasw21 commented on 2021-08-25
adammoody
thomasw21
adammoody
thomasw21
adammoody
adammoody
thomasw21
thomasw21
thomasw21 thomasw21 merged 97221116 into main 4 years ago
stas00
adammoody
adammoody adammoody deleted the pmerge branch 4 years ago

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
Labels
Milestone