PR #55 distributed merge of per-rank Megatron data files

add parallel merge using mpi

269af4eb

handle case where some ranks might have 0 items

9ba081be

add inclusive scan prefix sum

d29a7023

report more timing info

ed497132

adammoody force pushed from 488f440a to ed497132 4 years ago

thomasw21 commented on 2021-08-12

Update megatron/data/indexed_dataset.py

e94f2a0c

Update megatron/data/indexed_dataset.py

687ff32f

rename total size variable for clarity

af595454

adammoody changed the title ~~MPI-based parallel merge of per-rank Megatron data files~~ WIP: MPI-based parallel merge of per-rank Megatron data files 4 years ago

move translation to bin/idx file names a level deeper

4f648a0a

parallel merge for cached dataset

9f2ba6ae

add alltrue function

72d6c9c2

move collectives to new distdata class, add torch.distributed

8b67becb

drop unused prefix_sum function

3eca1f35

allow ranks to pass a list of files to be merged

a691b481

check that input dataset files exist

e4a34e2a

fix: using wrong doc_idx list for mmap

8b168cab

move init dist and collectives to distdata class

7a026938

thomasw21 commented on 2021-08-16

add --merge option, move parallel/serial to their own functions

eca2940f

Merge branch 'main' into pmerge

b14491df

Update megatron/data/distdata.py

ec11281f

Update megatron/data/indexed_dataset.py

354d13bd

thomasw21 assigned

adammoody 4 years ago

thomasw21 added enhancement

Update megatron/data/indexed_dataset.py

2dc3f7ad

Update megatron/data/indexed_dataset.py

980e9043

Update megatron/data/indexed_dataset.py

ebd20a6f

Update megatron/data/indexed_dataset.py

69b2f49b

Update megatron/data/indexed_dataset.py

50de06ac

drop extraneous numpy tolist calls

af290ad9

rename self.MPI to mpi4py

4b58c74c

handle case where no ranks have elements in their file

71a2fdcf

rename tokenize_start to time_start

73d3a247

drop unrelated comment in distdata.min

b9e69bea

add comment why pointers_shift is not None and add assert

da615c6d

note why pointers uses sizes count and offset values

c42f41f5

can just rely on rank 0 for the leading 0 element

a3a7d539

add write_list function

163310aa

determine element size

01b2be07

add checks for consistent element_size values

4b6e8ffa

check that at least one rank has a file to merge

ea085555

adammoody changed the title ~~WIP: MPI-based parallel merge of per-rank Megatron data files~~ WIP: parallel merge of per-rank Megatron data files 4 years ago

assert that torch backend is gloo or mpi

2524fce6

add collectives for assert and raise

ca14d48d

rename to allassert and allraise_if

d482f36f

check dtype instead of element_size

28d76f57

add uint32 to element_sizes table

f706108a

infer dtype from files being merged

f1228837

add write_header function to indexed dataset classes

57c012e0

call write_header internally from IndexedDataset classes

eed83271

return number of bytes written from write calls

a75cfc2c

Merge branch 'main' into pmerge

afcfcf95

move scatterv to distdata class

74b733a4

add functions to format status and error messages

dadb51b4

defer merge_files_dist to future PR

a2f8fa0f

open files using with, refresh comments

39e6cd74

rely on default torch datatypes

2a29d996

fix some status messages from preprocess script

d6fa8959

fix: exclusive scan computing pointers list

1216c0ab

Merge branch 'pointerfix' into pmerge

a64d3dab

fix: exclusive scan to compute mmap pointers list

fde439ec

note about seek

ba14351e

rename preprocess_dataset_mpi.py to preprocess_data_dist.py

852fdd0c

update usage comments at top of script

61f4b467

restore commented print_rank_0 statements

22400f37

restore status message in mmap merge_file_

5cfcb955

drop mpi4py, sad :(

74c48831

Merge branch 'main' into pmerge

373e5145

add test case for parallel merge

78ab7158

add preprocess_data_dist test for serial merge

002b4032

adammoody changed the title ~~WIP: parallel merge of per-rank Megatron data files~~ parallel merge of per-rank Megatron data files 4 years ago

adammoody changed the title ~~parallel merge of per-rank Megatron data files~~ distributed merge of per-rank Megatron data files 4 years ago

thomasw21 commented on 2021-08-19

improve error handling

ba763f7c

refactor get_pointers code

fa111591

bug fix in exscan

7e53fd34

further refactor get_pointers

53df36f4

move exscan collective for pointers outside of try block

c43348ff

clarify some comments

81c21dd5

include string 1k in name of test files

adee502b

use temporary file for index

13ae421d

fix: implement scatterv from torch.distributed.scatter

f3e1b1dc

thomasw21 commented on 2021-08-23

thomasw21 commented on 2021-08-24

switch to pad method in torch.nn.functional

42962e1b

return data received in scatterv as new tensor

9a2f3838

raise exception if conflicting scratch and merge options

15b7603a

use allraise method from distdata in preprocess_data_dist

4adaddd7

thomasw21 commented on 2021-08-25

thomasw21 merged 97221116 into main 4 years ago

adammoody deleted the pmerge branch 4 years ago

Megatron-DeepSpeed
distributed merge of per-rank Megatron data files
#55

Merged

distributed merge of per-rank Megatron data files #55

Megatron-DeepSpeed distributed merge of per-rank Megatron data files #55 Merged

distributed merge of per-rank Megatron data files #55

Megatron-DeepSpeed
distributed merge of per-rank Megatron data files
#55

Merged