Optimize FastGelu with float2 and float4 vectorized kernels on ROCm (#11491)
* Using vectorized loads (float2) for fp16 to improve performance
* Fix a few warnings from cpplint
* Fix a few warnings from cpplint
* Use __float2half2_rn and fix some cpplint warnings
* Move some computaions to LaunchFastGeluKernel
* Fix some Lint C++ warning
* Using vectorized loads (float4) for fp16 to improve performance
* Switch whether to optimize FastGelu with float4 vectorization
* Switch to float4 memory access based on input_length in FastGelu
* Comment how to set the threshold of float2 and float4 vectorized kernels
* Add FastGelu fp16 unit tests for bias_length = 2 and 8
* Make vectorized kernels generic with aligned_vector
* Unify the vectorized kernels with/without bias
* Refactor the code to suppress cpplint warnings
* Solve formatting issues
* Remove cudaDeviceProp from FastGeluKernel and LaunchFastGeluKernel
* Move fast_gelu_impl.h to rocm/bert
* Fix some Lint C++ warnings and code alignment