onnxruntime
f4ba199b - Optimize FastGelu with float2 and float4 vectorized kernels on ROCm (#11491)

Commit

3 years ago

Optimize FastGelu with float2 and float4 vectorized kernels on ROCm (#11491) * Using vectorized loads (float2) for fp16 to improve performance * Fix a few warnings from cpplint * Fix a few warnings from cpplint * Use __float2half2_rn and fix some cpplint warnings * Move some computaions to LaunchFastGeluKernel * Fix some Lint C++ warning * Using vectorized loads (float4) for fp16 to improve performance * Switch whether to optimize FastGelu with float4 vectorization * Switch to float4 memory access based on input_length in FastGelu * Comment how to set the threshold of float2 and float4 vectorized kernels * Add FastGelu fp16 unit tests for bias_length = 2 and 8 * Make vectorized kernels generic with aligned_vector * Unify the vectorized kernels with/without bias * Refactor the code to suppress cpplint warnings * Solve formatting issues * Remove cudaDeviceProp from FastGeluKernel and LaunchFastGeluKernel * Move fast_gelu_impl.h to rocm/bert * Fix some Lint C++ warnings and code alignment

References

#11491 - Optimize FastGelu with float2 and float4 vectorized kernels on ROCm

Author

hubertlu-tw

Parents

088bc749

onnxruntime f4ba199b - Optimize FastGelu with float2 and float4 vectorized kernels on ROCm (#11491)

onnxruntime
f4ba199b - Optimize FastGelu with float2 and float4 vectorized kernels on ROCm (#11491)