Megatron-DeepSpeed
4c2aea04 - Fix throughput unit (#241)

Commit
3 years ago
Fix throughput unit (#241) * Add scripts for fixing up outdated log files - `rescale-logs.py` fixes up textual logs, such as Slurm output files. It adjusts and fixes time values to be in seconds. - `tb-rescale-scalars.py` allows scaling scalar values in TensorBoard files. By default, it fixes throughput units. - `tb-rename-events.py` allows updating names of log events in TensorBoard files. * Fix throughput unit Was in milliseconds, is now in seconds. Variable names and log strings indicate the unit is supposed to be seconds. - To fix up old TensorBoard log files, see `/tools/logs/tb-rescale-scalars.py`. - To fix up old "print-output" log files, see `/tools/logs/rescale-logs.py`. Fix #236. * Adjust unit of "elapsed time per iteration" log Now corresponds to other logged time values which are also measured in seconds. Increase print precision accordingly (not equivalently) in order to handle low values better. To fix up old "print-output" log files, see `/tools/logs/rescale-logs.py`.
Author
Parents
Loading