onnxruntime
e1901a7e - Improve performance of CUDA implementations for GatherElements and Greater, Equal and Less (#4989)

Commit
5 years ago
Improve performance of CUDA implementations for GatherElements and Greater, Equal and Less (#4989) Make GatherElements kernel process 16 items each. unroll the constant loop. Quit loops early for zero dividend. Optimize Binary CompareFunction and remove Impl_Cast invocation.
Author
Parents
Loading