onnxruntime
6651d2f6 - Make elementwise op run 4 items per thread (#2335)

Commit
6 years ago
Make elementwise op run 4 items per thread (#2335) Description: Describe your changes. Make elementwise op run 4 items per thread unroll for loop to leverage ILP remove unnessary N==0 check inside elementwise GPU kernel Motivation and Context Why is this change required? What problem does it solve? It can improve the performance of GPU elementwise ops. ~2% performance gain on popular NLP bert model. If it fixes an open issue, please link to the issue here.
Author
Parents
Loading