Megatron-DeepSpeed
5069622a - use HuggingFace Datasets as source to build Megatron data files (#48)

Commit

4 years ago

use HuggingFace Datasets as source to build Megatron data files (#48) * indexed_dataset: use numpy to compute byte offsets faster * preprocess with huggingface datasets and mpi * preprocess_dataset_mpi: add --shuffle and --seed options * indexed_dataset: fix to handle file with 0 items * preprocess_dataset_mpi: add --split and --count options * update script comments to reflect shuffle behavior * add torch.distributed version * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * add estimated progress logging * avoid downloading dataset unless user really wants to * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * refactor main into more functions * reformat progress messages * move mpi4py import test to get_args * drop Open MPI variables from init_process_group * add --local_rank to support torch.distributed.launch * update from DeepSpeedExamples * raise exceptions on errors * drop --download option * format byte rate as MB/s * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * move datasets import back to top * import config from datasets Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

References

#48 - use HuggingFace Datasets as source to build Megatron data files

Author

adammoody

Parents

3c9d748b

Megatron-DeepSpeed 5069622a - use HuggingFace Datasets as source to build Megatron data files (#48)

Megatron-DeepSpeed
5069622a - use HuggingFace Datasets as source to build Megatron data files (#48)