GatherGrad optimization (#5524)

Commit

5 years ago

GatherGrad optimization (#5524) The existing implementation of the GatherGrad CUDA kernel does not do work in a very parallel manner for certain inputs which can lead to poor performance. The computation essentially involves multiple summations. The values are gathered from the input and the sums are scattered to the output. Previously, each sum was computed by a single thread. If there is an instance of a summation of a large number of values, it can significantly impact the overall kernel execution time. The updated version has an alternate implementation which splits the sums into partial sums which get accumulated together later. This allows for more parallelism. A significant downside is that the alternate implementation requires CPU and GPU synchronization because intermediate GPU results are required by the CPU computation. The original implementation outperformed the alternate for certain inputs (e.g., where the maximum number of values in a sum was not large), so the updated version chooses between them based on the input. The input analysis has some overhead. The implementation was adapted from PyTorch (https://github.com/pytorch/pytorch/blob/b186831c08e0e4e447eedb8a5cfab582995d37f9/aten/src/ATen/native/cuda/EmbeddingBackwardKernel.cu).

References

#5524 - GatherGrad optimization

Author

edgchen1

Parents

8224718f

onnxruntime 68fe7226 - GatherGrad optimization (#5524)

onnxruntime
68fe7226 - GatherGrad optimization (#5524)