pytorch
74dcb6d3 - torch.xlogy: Use wrapped_scalar_tensor / gpu_with_scalars to speed up GPU kernel. (#49926)

Commit View On GitHub

Commit

3 years ago

torch.xlogy: Use wrapped_scalar_tensor / gpu_with_scalars to speed up GPU kernel. (#49926) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49926 While investigating https://github.com/pytorch/pytorch/issues/49758, I changed the xlogy kernel to use the recommended wrapped_scaler_tensor pattern instead of moving the scalar to the GPU as a tensor. While this doesn't avoid a synchronization (there is no synchronization in the move, as its done via fill), this does significantly speed up the GPU kernel (almost ~50%, benchmark in PR comments). From looking at the nvprof output, it looks like this code path avoids broadcasting. Aside: this seems unnecessary, as there is nothing special from the point-of-view of broadcasting whether the Tensor is ()-sized or marked as a wrapped_scalar. Still, this is a useful change to make as we avoid extra kernel launches and dispatches to create and fill the tensor. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D25724215 Pulled By: gchanan fbshipit-source-id: 4adcd5d8b3297502672ffeafc77e8af80592f460

Author

gchanan

Committer

facebook-github-bot

Parents

483670ff

pytorch 74dcb6d3 - torch.xlogy: Use wrapped_scalar_tensor / gpu_with_scalars to speed up GPU kernel. (#49926)

Commit

pytorch
74dcb6d3 - torch.xlogy: Use wrapped_scalar_tensor / gpu_with_scalars to speed up GPU kernel. (#49926)