Migrate glu from the THC to ATen (CUDA) (#61153)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61153
Fixes gh-24571, fixes gh-24572
Closes gh-39586, closes gh-39586
Benchmarks
----------
The benchmarks were run with nvprof calling the operator in a loop. It shows
reliable improvements for large tensors, but the TH implementation seems to fair
better for smaller tensors. For sufficiently large tensors, the ATen
implementation does win though.
| Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) |
|-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:|
| 128, 1000 | 0 | 2.4770 | 2.0820 | 3.0440 | 3.4680 |
| | 1 | 2.7060 | 4.4850 | 3.3380 | 3.6250 |
| 128, 10000 | 0 | 26.531 | 21.366 | 38.083 | 34.623 |
| | 1 | 27.680 | 30.465 | 38.943 | 35.204 |
| 128, 100000 | 0 | 292.09 | 219.56 | 355.57 | 324.49 |
| | 1 | 260.43 | 243.08 | 332.25 | 323.37 |
| 128, 1000000 | 0 | 2475.7 | 1874.6 | 3810.1 | 3215.7 |
| | 1 | 2586.3 | 2380.9 | 3349.9 | 3207.8 |
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D29538093
Pulled By: ngimel
fbshipit-source-id: 1f66b45ec7c46fb8e680b50110a5fde6fe7faab7