PR #92 WIP: distributed terashuf

add parallel merge using mpi

269af4eb

handle case where some ranks might have 0 items

9ba081be

add inclusive scan prefix sum

d29a7023

report more timing info

ed497132

Update megatron/data/indexed_dataset.py

e94f2a0c

Update megatron/data/indexed_dataset.py

687ff32f

rename total size variable for clarity

af595454

move translation to bin/idx file names a level deeper

4f648a0a

parallel merge for cached dataset

9f2ba6ae

add alltrue function

72d6c9c2

move collectives to new distdata class, add torch.distributed

8b67becb

drop unused prefix_sum function

3eca1f35

allow ranks to pass a list of files to be merged

a691b481

check that input dataset files exist

e4a34e2a

fix: using wrong doc_idx list for mmap

8b168cab

move init dist and collectives to distdata class

7a026938

add --merge option, move parallel/serial to their own functions

eca2940f

Merge branch 'main' into pmerge

b14491df

Update megatron/data/distdata.py

ec11281f

Update megatron/data/indexed_dataset.py

354d13bd

Update megatron/data/indexed_dataset.py

2dc3f7ad

Update megatron/data/indexed_dataset.py

980e9043

Update megatron/data/indexed_dataset.py

ebd20a6f

Update megatron/data/indexed_dataset.py

69b2f49b

Update megatron/data/indexed_dataset.py

50de06ac

drop extraneous numpy tolist calls

af290ad9

rename self.MPI to mpi4py

4b58c74c

handle case where no ranks have elements in their file

71a2fdcf

rename tokenize_start to time_start

73d3a247

drop unrelated comment in distdata.min

b9e69bea

add comment why pointers_shift is not None and add assert

da615c6d

note why pointers uses sizes count and offset values

c42f41f5

can just rely on rank 0 for the leading 0 element

a3a7d539

add write_list function

163310aa

determine element size

01b2be07

add checks for consistent element_size values

4b6e8ffa

check that at least one rank has a file to merge

ea085555

assert that torch backend is gloo or mpi

2524fce6

add collectives for assert and raise

ca14d48d

rename to allassert and allraise_if

d482f36f

check dtype instead of element_size

28d76f57

add uint32 to element_sizes table

f706108a

infer dtype from files being merged

f1228837

add write_header function to indexed dataset classes

57c012e0

call write_header internally from IndexedDataset classes

eed83271

return number of bytes written from write calls

a75cfc2c

Merge branch 'main' into pmerge

afcfcf95

move scatterv to distdata class

74b733a4

add functions to format status and error messages

dadb51b4

defer merge_files_dist to future PR

a2f8fa0f

open files using with, refresh comments

39e6cd74

rely on default torch datatypes

2a29d996

fix some status messages from preprocess script

d6fa8959

fix: exclusive scan computing pointers list

1216c0ab

Merge branch 'pointerfix' into pmerge

a64d3dab

fix: exclusive scan to compute mmap pointers list

fde439ec

abstraction to index and randomly access jsonl files

fb274bfe

rebase on parallel merge, replace mpi4py with distdata class

d428c025

note about seek

ba14351e

rename preprocess_dataset_mpi.py to preprocess_data_dist.py

852fdd0c

update usage comments at top of script

61f4b467

Merge branch 'pmerge' into mpijson

18881ae0

look for extension .jsonl

bd6f41fb

add progress messages

3488d0bc

rebuild index if mtime is old

1305fe93

store index values in network byte order

6bcac1fd

add magic value and format version number to index file

813d0683

Merge branch 'main' into mpijson

0510081b

clean up merge

1fea302d

clean up merge

d3603130

pass distctx instead of mpi_comm to IndexedJSON

20a43afe

move existence test and stat queries to distdata

7b083479

add exception handling

8d448bce

edit typos in comments

6f7519f0

close shared file if open fails on any rank

3f9078d4

add distributed shuffle

fbd38bfc

shuffle on each rank to keep rng in step

927fbc17

optimize broadcast and sample ident steps with numpy

d243847f

add timer for global shuffle step

6d7dd4b8

generate random seed if not specified

6bc9b943

add function to concatenate files

0b4a4cad

Megatron-DeepSpeed
WIP: distributed terashuf
#92

Open

WIP: distributed terashuf #92

Megatron-DeepSpeed WIP: distributed terashuf #92 Open

WIP: distributed terashuf #92

Megatron-DeepSpeed
WIP: distributed terashuf
#92

Open