Improve performance of CUDA implementations for GatherElements and Greater, Equal and Less (#4989)
Make GatherElements kernel process 16 items each.
unroll the constant loop. Quit loops early for zero dividend.
Optimize Binary CompareFunction and remove Impl_Cast invocation.