Improve performance of CUDA implementations for GatherElements and Greater, Equal and Less #4989
Start with a base line by refctoring templates only.
df4370f7
Settle on thread_work_size = 8 so we do not loose too much parallelism.
3a532c85
Refactor Gather
dcb4855a
xx
2c666c9b
Add check for remain > 0 to skip a lot of divisions.
6b69deca
Merge branch 'master' into yuslepukhin/hummingbird
1dfe5856
Fix initialization
120d7559
Optimize Binary CompareFunction and remove Impl_Cast invocation.
1bb6ec4d
snnn
approved these changes
on 2020-09-02
yuslepukhin
deleted the yuslepukhin/hummingbird branch 5 years ago
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub