MAINT Migrates rrelu_with_noise from THC to ATen on Cuda (#57864)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24618
Related to https://github.com/pytorch/pytorch/issues/24507
<details><summary>Benchmark script:</summary>
```py
import torch
import torch.nn as nn
import time
torch.manual_seed(0)
def _time():
torch.cuda.synchronize()
return time.time()
device = "cuda"
m = nn.RReLU().cuda()
for n in [100, 10_000, 100_000]:
fwd_t = 0
bwd_t = 0
input = torch.randn(128, n, device=device)
grad_output = torch.ones(128, n, device=device)
for i in range(10000):
t1 = _time()
output = m(input)
t2 = _time()
fwd_t = fwd_t + (t2 -t1)
fwd_avg = fwd_t / 10000 * 1000
print(f"input size(128, {n}) forward time is {fwd_avg:.2f} (ms)")
```
</details>
### Results from benchmark:
#### This PR
```
input size(128, 100) forward time is 0.01 (ms)
input size(128, 10000) forward time is 0.06 (ms)
input size(128, 100000) forward time is 0.54 (ms)
```
#### On master
```
input size(128, 100) forward time is 0.01 (ms)
input size(128, 10000) forward time is 0.08 (ms)
input size(128, 100000) forward time is 0.66 (ms)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57864
Reviewed By: H-Huang
Differential Revision: D29177169
Pulled By: ngimel
fbshipit-source-id: 4572133db06f143d27e70a91ade977ea962c8f77