Add hipified SkipLayerNorm code for ROCmEP (#12107)
* First attempt for half2 vectorized memory access in SkipLayerNorm
* Add some functions for debugging
* Clean up the code
* Clean up the code
* Generalize the vectorized kernels with aligned_vector and remove cudaDeviceProp
* Add a unit test for a larger input size
* Fix some Lint C++ warnings
* Use ILP = 4 for the vectorized kernels
* Rewrite the vectorized kernel and templatize ComputeSkipLayerNorm
* Use conditional operator for input_v
* Refactor LaunchSkipLayerNormKernel and replace the original SkipLayerNormKernelSmall with the vectorized kernel
* Clean some comments and rename the layernorm function
* Use ComputeSkipLayerNorm to replace LaunchSkipLayerNormKernel
* Resolve a Lint C++ warning
* Fix SkipLayerNormBatch1_Float16_vec output data
* Add hipified code of bert SkipLayerNorm for ROCmEP
* Resolve some Lint C++ warnings
* Resolve some Lint C++ warnings
* Resolve some Lint C++ warnings
* Resolve Python formatting issue