Simplified layer norm changes (#5028)
* t5 layer norm changes
* add t5 layer norm kernel
* use template for t5 layer norm
* template definition changes
* no build error
* add CPU cuda kernel
* first unit test
* other forward unit tests
* add T5LayerNormGrad
* Add c++ transform and test for T5 LN
* fix and some debug prints
* fix cuda error
* rename from t5 to simplified
* PR comments
* revert change on invertible LM code path
* remove duplicate forward computation
* add GradientCheckerTest.SimplifiedLayerNormGrad
* change back macro
* Fix SimplifiedLayerNorm Gradient
* merge with Sherlockss changes
* changed cuda kernel
* reapply cpu kernel changes
Co-authored-by: Jingyan Wang <jingywa@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: aishwarya bhandare <aibhanda@microsoft.com>
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>