onnxruntime
6651d2f6 - Make elementwise op run 4 items per thread (#2335)

Commit

6 years ago

Make elementwise op run 4 items per thread (#2335) Description: Describe your changes. Make elementwise op run 4 items per thread unroll for loop to leverage ILP remove unnessary N==0 check inside elementwise GPU kernel Motivation and Context Why is this change required? What problem does it solve? It can improve the performance of GPU elementwise ops. ~2% performance gain on popular NLP bert model. If it fixes an open issue, please link to the issue here.

References

#2335 - Make elementwise op run 4 items per thread

Author

yufenglee

Parents

ba0e7daf

onnxruntime 6651d2f6 - Make elementwise op run 4 items per thread (#2335)

onnxruntime
6651d2f6 - Make elementwise op run 4 items per thread (#2335)