Add megatron transforms for BART (#5521)
* Large model export and run ORT Python support
* Megatron change
refine a bit
workaround self attention issue
use partitioned name for weights when megatron model parallel is enabled
Fix Megatron Transformer Issue (cuased by the renaming)
Add UTs for T5 model parallel
Fix megatron seed issue
fix log a bit
checkkpointing changes + rebase
Unintended reshape transform change
t5 layer norm changes
add t5 layer norm kernel
use template for t5 layer norm
template definition changes
no build error
add CPU cuda kernel
first unit test
other forward unit tests
add T5LayerNormGrad
Add c++ transform and test for T5 LN
minor fix
BART MLP Megatron tranform
Add concat slice transform + test
Cosmetic improvements in concat slice transform
Constant folding bug fix + megatron attention transform for BART
Undo unnecessary changes
* Cleanup
* Remove unnecessary changes
* Cleanup megatron
* Windows build
* Add self attention test graph
* Correcting transforms + cleanup
* review comments
* review comments
* fix build and test failures
* Fix CI
* fix windows CI
Co-authored-by: Peng Wang <pengwa@microsoft.com>
Co-authored-by: Aishwarya <aibhanda@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>