use HuggingFace Datasets as source to build Megatron data files #48
indexed_dataset: use numpy to compute byte offsets faster
f68999f7
preprocess with huggingface datasets and mpi
d5d20bbc
preprocess_dataset_mpi: add --shuffle and --seed options
32fc48fe
indexed_dataset: fix to handle file with 0 items
a456e484
preprocess_dataset_mpi: add --split and --count options
17ac2f97
update script comments to reflect shuffle behavior
7836a327
add torch.distributed version
7b27853d
Update tools/preprocess_dataset_mpi.py
92d78c4f
Update tools/preprocess_dataset_mpi.py
38b2d8ad
Update tools/preprocess_dataset_mpi.py
6264d7a6
Update tools/preprocess_dataset_mpi.py
782151f8
Update tools/preprocess_dataset_mpi.py
88f5d0bd
Update tools/preprocess_dataset_mpi.py
31fab0e8
Update tools/preprocess_dataset_mpi.py
520d06ce
add estimated progress logging
6e0e4fd0
avoid downloading dataset unless user really wants to
03bf1997
thomasw21
approved these changes
on 2021-08-07
Update tools/preprocess_dataset_mpi.py
0b2d4cdd
Update tools/preprocess_dataset_mpi.py
cbf965ef
refactor main into more functions
7c8c1c92
reformat progress messages
600c0911
move mpi4py import test to get_args
6bb27f7e
drop Open MPI variables from init_process_group
7ee7bf5a
add --local_rank to support torch.distributed.launch
f0f45b96
update from DeepSpeedExamples
3be3423f
raise exceptions on errors
8ae0cf84
drop --download option
a8e9b2e4
format byte rate as MB/s
fa9e3236
thomasw21
approved these changes
on 2021-08-10
Update tools/preprocess_dataset_mpi.py
3db9cdb9
move datasets import back to top
764e760d
import config from datasets
80ee2308
stas00
merged
5069622a
into main 4 years ago
adammoody
deleted the hfdset branch 4 years ago
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub