use HuggingFace Datasets as source to build Megatron data files (#48)
* indexed_dataset: use numpy to compute byte offsets faster
* preprocess with huggingface datasets and mpi
* preprocess_dataset_mpi: add --shuffle and --seed options
* indexed_dataset: fix to handle file with 0 items
* preprocess_dataset_mpi: add --split and --count options
* update script comments to reflect shuffle behavior
* add torch.distributed version
* Update tools/preprocess_dataset_mpi.py
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
* Update tools/preprocess_dataset_mpi.py
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
* Update tools/preprocess_dataset_mpi.py
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
* Update tools/preprocess_dataset_mpi.py
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
* Update tools/preprocess_dataset_mpi.py
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
* Update tools/preprocess_dataset_mpi.py
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
* Update tools/preprocess_dataset_mpi.py
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
* add estimated progress logging
* avoid downloading dataset unless user really wants to
* Update tools/preprocess_dataset_mpi.py
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
* Update tools/preprocess_dataset_mpi.py
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
* refactor main into more functions
* reformat progress messages
* move mpi4py import test to get_args
* drop Open MPI variables from init_process_group
* add --local_rank to support torch.distributed.launch
* update from DeepSpeedExamples
* raise exceptions on errors
* drop --download option
* format byte rate as MB/s
* Update tools/preprocess_dataset_mpi.py
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
* move datasets import back to top
* import config from datasets
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>