Floating-point ops counting and reloading (#40)
* initial flo count/logging setup (need to fix model parameter count)
* initial flo count/logging setup (need to fix model parameter count)
* 1B3 parameter setup + flos counting
* 1B3 parameter setup + flos counting
* 1B3 parameter setup + flos counting
* 1B3 parameter setup
* 1B3 parameter setup
* synched with latest 13B script
* synched with latest 13B script
* pipe transformer docstring
* improve DS integration evaluation + logging
* use pp engine even for pp=1 (#6)
* removed slurm_examples
* flos re-loading
* Update megatron/training.py
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
* Update megatron/data/gpt_dataset.py
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
* Update megatron/utils.py
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
* Update megatron/utils.py
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
* formatting fix, reserving bug for somewhere else, adding flo-logging to TB groups
* indentation bug
* fixing possible double counts
* tweaks
* warning for double counts
Co-authored-by: Shaden Smith <shaden.smith@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: TevenLeScao <uhk85as@jean-zay1.idris.fr>
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>