Megatron-DeepSpeed
use HuggingFace Datasets as source to build Megatron data files
#48
Merged

use HuggingFace Datasets as source to build Megatron data files #48

stas00 merged 30 commits into bigscience-workshop:main from adammoody:hfdset
adammoody
adammoody indexed_dataset: use numpy to compute byte offsets faster
f68999f7
adammoody preprocess with huggingface datasets and mpi
d5d20bbc
stas00
stas00
stas00
adammoody preprocess_dataset_mpi: add --shuffle and --seed options
32fc48fe
thomasw21
huu4ontocord
adammoody indexed_dataset: fix to handle file with 0 items
a456e484
adammoody preprocess_dataset_mpi: add --split and --count options
17ac2f97
adammoody
adammoody update script comments to reflect shuffle behavior
7836a327
adammoody add torch.distributed version
7b27853d
adammoody
huu4ontocord
huu4ontocord commented on 2021-08-06
thomasw21
thomasw21 commented on 2021-08-06
adammoody Update tools/preprocess_dataset_mpi.py
92d78c4f
adammoody Update tools/preprocess_dataset_mpi.py
38b2d8ad
adammoody Update tools/preprocess_dataset_mpi.py
6264d7a6
adammoody Update tools/preprocess_dataset_mpi.py
782151f8
adammoody Update tools/preprocess_dataset_mpi.py
88f5d0bd
adammoody Update tools/preprocess_dataset_mpi.py
31fab0e8
adammoody Update tools/preprocess_dataset_mpi.py
520d06ce
adammoody add estimated progress logging
6e0e4fd0
adammoody avoid downloading dataset unless user really wants to
03bf1997
thomasw21
thomasw21 approved these changes on 2021-08-07
adammoody Update tools/preprocess_dataset_mpi.py
0b2d4cdd
adammoody Update tools/preprocess_dataset_mpi.py
cbf965ef
adammoody refactor main into more functions
7c8c1c92
adammoody
adammoody commented on 2021-08-08
adammoody
thomasw21
thomasw21 commented on 2021-08-07
adammoody reformat progress messages
600c0911
adammoody move mpi4py import test to get_args
6bb27f7e
adammoody drop Open MPI variables from init_process_group
7ee7bf5a
adammoody add --local_rank to support torch.distributed.launch
f0f45b96
adammoody
thomasw21
adammoody
adammoody
adammoody update from DeepSpeedExamples
3be3423f
adammoody raise exceptions on errors
8ae0cf84
adammoody drop --download option
a8e9b2e4
adammoody
adammoody
thomasw21
huu4ontocord
adammoody
adammoody
adammoody
adammoody format byte rate as MB/s
fa9e3236
adammoody
thomasw21
thomasw21
thomasw21 approved these changes on 2021-08-10
adammoody Update tools/preprocess_dataset_mpi.py
3db9cdb9
adammoody move datasets import back to top
764e760d
adammoody import config from datasets
80ee2308
thomasw21
stas00 stas00 merged 5069622a into main 4 years ago
stas00
adammoody adammoody deleted the hfdset branch 4 years ago
adammoody
stas00
stas00

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone