onnxruntime
6c7da5e9 - Optimize CUDA Sum op kernel and refactor CUDA elementwise variadic input op kernels (#4418)

Commit
5 years ago
Optimize CUDA Sum op kernel and refactor CUDA elementwise variadic input op kernels (#4418) For the special case where all variadic inputs of a kernel are the same shape (i.e. no broadcasting is required) and there are few enough of them, we perform the entire computation in a single kernel. The general implementation (which was previously used for this special case) handles broadcasting by repeatedly invoking a binary kernel on successive inputs.
Author
Parents
Loading