distributed merge of per-rank Megatron data files #55
add parallel merge using mpi
269af4eb
handle case where some ranks might have 0 items
9ba081be
add inclusive scan prefix sum
d29a7023
report more timing info
ed497132
adammoody
force pushed
from
488f440a
to
ed497132
4 years ago
Update megatron/data/indexed_dataset.py
e94f2a0c
Update megatron/data/indexed_dataset.py
687ff32f
rename total size variable for clarity
af595454
adammoody
changed the title MPI-based parallel merge of per-rank Megatron data files WIP: MPI-based parallel merge of per-rank Megatron data files 4 years ago
move translation to bin/idx file names a level deeper
4f648a0a
parallel merge for cached dataset
9f2ba6ae
add alltrue function
72d6c9c2
move collectives to new distdata class, add torch.distributed
8b67becb
drop unused prefix_sum function
3eca1f35
allow ranks to pass a list of files to be merged
a691b481
check that input dataset files exist
e4a34e2a
fix: using wrong doc_idx list for mmap
8b168cab
move init dist and collectives to distdata class
7a026938
add --merge option, move parallel/serial to their own functions
eca2940f
Merge branch 'main' into pmerge
b14491df
Update megatron/data/distdata.py
ec11281f
Update megatron/data/indexed_dataset.py
354d13bd
Update megatron/data/indexed_dataset.py
2dc3f7ad
Update megatron/data/indexed_dataset.py
980e9043
Update megatron/data/indexed_dataset.py
ebd20a6f
Update megatron/data/indexed_dataset.py
69b2f49b
Update megatron/data/indexed_dataset.py
50de06ac
drop extraneous numpy tolist calls
af290ad9
rename self.MPI to mpi4py
4b58c74c
handle case where no ranks have elements in their file
71a2fdcf
rename tokenize_start to time_start
73d3a247
drop unrelated comment in distdata.min
b9e69bea
add comment why pointers_shift is not None and add assert
da615c6d
note why pointers uses sizes count and offset values
c42f41f5
can just rely on rank 0 for the leading 0 element
a3a7d539
add write_list function
163310aa
determine element size
01b2be07
add checks for consistent element_size values
4b6e8ffa
check that at least one rank has a file to merge
ea085555
adammoody
changed the title WIP: MPI-based parallel merge of per-rank Megatron data files WIP: parallel merge of per-rank Megatron data files 4 years ago
assert that torch backend is gloo or mpi
2524fce6
add collectives for assert and raise
ca14d48d
rename to allassert and allraise_if
d482f36f
check dtype instead of element_size
28d76f57
add uint32 to element_sizes table
f706108a
infer dtype from files being merged
f1228837
add write_header function to indexed dataset classes
57c012e0
call write_header internally from IndexedDataset classes
eed83271
return number of bytes written from write calls
a75cfc2c
Merge branch 'main' into pmerge
afcfcf95
move scatterv to distdata class
74b733a4
add functions to format status and error messages
dadb51b4
defer merge_files_dist to future PR
a2f8fa0f
open files using with, refresh comments
39e6cd74
rely on default torch datatypes
2a29d996
fix some status messages from preprocess script
d6fa8959
fix: exclusive scan computing pointers list
1216c0ab
Merge branch 'pointerfix' into pmerge
a64d3dab
fix: exclusive scan to compute mmap pointers list
fde439ec
note about seek
ba14351e
rename preprocess_dataset_mpi.py to preprocess_data_dist.py
852fdd0c
update usage comments at top of script
61f4b467
restore commented print_rank_0 statements
22400f37
restore status message in mmap merge_file_
5cfcb955
drop mpi4py, sad :(
74c48831
Merge branch 'main' into pmerge
373e5145
add test case for parallel merge
78ab7158
add preprocess_data_dist test for serial merge
002b4032
adammoody
changed the title WIP: parallel merge of per-rank Megatron data files parallel merge of per-rank Megatron data files 4 years ago
adammoody
changed the title parallel merge of per-rank Megatron data files distributed merge of per-rank Megatron data files 4 years ago
improve error handling
ba763f7c
refactor get_pointers code
fa111591
bug fix in exscan
7e53fd34
further refactor get_pointers
53df36f4
move exscan collective for pointers outside of try block
c43348ff
clarify some comments
81c21dd5
include string 1k in name of test files
adee502b
use temporary file for index
13ae421d
fix: implement scatterv from torch.distributed.scatter
f3e1b1dc
switch to pad method in torch.nn.functional
42962e1b
return data received in scatterv as new tensor
9a2f3838
raise exception if conflicting scratch and merge options
15b7603a
use allraise method from distdata in preprocess_data_dist
4adaddd7
thomasw21
merged
97221116
into main 4 years ago
adammoody
deleted the pmerge branch 4 years ago
Login to write a write a comment.
Login via GitHub