[DRAFT] Tentative implementation of MiCS (#2964)
* include mics config and optimizer
* change private vars to public vars
so the child class can initialize these vars
* Port the init function from stage3
* adding a model test file for mics
* adopt to get_acceleartor api and fp16 group defrag
* WIP: porting mics modification to ms master
* WIP: included gradient all-reduce among replication groups
* WIP: ported hierarchical all gather part
did basic loss test on a simple MLP model
* [Bug fix] using the comm group attached on the param
* torch2.0 support
* remove print
* delegate wait op
* [Bug] fix naming
* adding doc string
* resolving recursive import
* fix formating, typo and license
* fix license and unit test error
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-14-191.us-west-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-7-70.us-west-2.compute.internal>
Co-authored-by: Zhen Zhang <zhzhn@amazon.com>
Co-authored-by: zhzhn <zhzhn@ip-10-2-57-114.us-west-2.compute.internal>