WIP: distributed terashuf #92
add parallel merge using mpi
269af4eb
handle case where some ranks might have 0 items
9ba081be
add inclusive scan prefix sum
d29a7023
report more timing info
ed497132
Update megatron/data/indexed_dataset.py
e94f2a0c
Update megatron/data/indexed_dataset.py
687ff32f
rename total size variable for clarity
af595454
move translation to bin/idx file names a level deeper
4f648a0a
parallel merge for cached dataset
9f2ba6ae
add alltrue function
72d6c9c2
move collectives to new distdata class, add torch.distributed
8b67becb
drop unused prefix_sum function
3eca1f35
allow ranks to pass a list of files to be merged
a691b481
check that input dataset files exist
e4a34e2a
fix: using wrong doc_idx list for mmap
8b168cab
move init dist and collectives to distdata class
7a026938
add --merge option, move parallel/serial to their own functions
eca2940f
Merge branch 'main' into pmerge
b14491df
Update megatron/data/distdata.py
ec11281f
Update megatron/data/indexed_dataset.py
354d13bd
Update megatron/data/indexed_dataset.py
2dc3f7ad
Update megatron/data/indexed_dataset.py
980e9043
Update megatron/data/indexed_dataset.py
ebd20a6f
Update megatron/data/indexed_dataset.py
69b2f49b
Update megatron/data/indexed_dataset.py
50de06ac
drop extraneous numpy tolist calls
af290ad9
rename self.MPI to mpi4py
4b58c74c
handle case where no ranks have elements in their file
71a2fdcf
rename tokenize_start to time_start
73d3a247
drop unrelated comment in distdata.min
b9e69bea
add comment why pointers_shift is not None and add assert
da615c6d
note why pointers uses sizes count and offset values
c42f41f5
can just rely on rank 0 for the leading 0 element
a3a7d539
add write_list function
163310aa
determine element size
01b2be07
add checks for consistent element_size values
4b6e8ffa
check that at least one rank has a file to merge
ea085555
assert that torch backend is gloo or mpi
2524fce6
add collectives for assert and raise
ca14d48d
rename to allassert and allraise_if
d482f36f
check dtype instead of element_size
28d76f57
add uint32 to element_sizes table
f706108a
infer dtype from files being merged
f1228837
add write_header function to indexed dataset classes
57c012e0
call write_header internally from IndexedDataset classes
eed83271
return number of bytes written from write calls
a75cfc2c
Merge branch 'main' into pmerge
afcfcf95
move scatterv to distdata class
74b733a4
add functions to format status and error messages
dadb51b4
defer merge_files_dist to future PR
a2f8fa0f
open files using with, refresh comments
39e6cd74
rely on default torch datatypes
2a29d996
fix some status messages from preprocess script
d6fa8959
fix: exclusive scan computing pointers list
1216c0ab
Merge branch 'pointerfix' into pmerge
a64d3dab
fix: exclusive scan to compute mmap pointers list
fde439ec
abstraction to index and randomly access jsonl files
fb274bfe
rebase on parallel merge, replace mpi4py with distdata class
d428c025
note about seek
ba14351e
rename preprocess_dataset_mpi.py to preprocess_data_dist.py
852fdd0c
update usage comments at top of script
61f4b467
Merge branch 'pmerge' into mpijson
18881ae0
look for extension .jsonl
bd6f41fb
add progress messages
3488d0bc
rebuild index if mtime is old
1305fe93
store index values in network byte order
6bcac1fd
add magic value and format version number to index file
813d0683
Merge branch 'main' into mpijson
0510081b
clean up merge
1fea302d
clean up merge
d3603130
pass distctx instead of mpi_comm to IndexedJSON
20a43afe
move existence test and stat queries to distdata
7b083479
add exception handling
8d448bce
edit typos in comments
6f7519f0
close shared file if open fails on any rank
3f9078d4
add distributed shuffle
fbd38bfc
shuffle on each rank to keep rng in step
927fbc17
optimize broadcast and sample ident steps with numpy
d243847f
add timer for global shuffle step
6d7dd4b8
generate random seed if not specified
6bc9b943
add function to concatenate files
0b4a4cad
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub