Add MPICH Multinode Runner (#2839)
* MPICH support
* MPICH changes
* MPICH changes
* MPICH changes
* MPICH changes
* accelerator runtime modifications
* Accelerator runtime changes
* Accelerator runtime modifications
* Remove redundant print from single node
* Move hostfile to tmp
* Code cleanup for MPICH class
* Code cleanup, rm whitespace
* Removing mpiexec environment check details
* Not needed tmp hostfile as pass directly
* Remove debugging comments
* rm print statement
* Revert comm changes as WA not needed
* Use MPICHRunner name for class
* Use MPICHRunner as class name
* No need to use args.force_multi and args.launcher .
This should be set in deepspeedexamples gpt-3.6b .sh script as:
$launcher=MPICH
run_cmd=" deepspeed --hostfile=${hostfile_ds} --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER} --launcher=${launcher} --force_multi pretrain_gpt2.py $@ ${gpt_options}"
* Adhere to code pattern
* Rm empty lines in MPICHRunner class
* Uncomment check for num nodes and workers when used hostfile_deepspeed in gpt-3.6b.sh
* pass MPICH hostfile through launcher_args in gpt-3.6b.sh
* Clean code and remove args hostfile
* fix merge
* fix merge
---------
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
* clean up and fix format
* add ut
---------
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>