[ROCm] replace ROCmLoops.cuh with hipified CUDALoops.cuh (#120101)
The intent of this change was to minimize code differences between CUDA and ROCm while maintaining or improving performance.
Verified new performance using pytorch/benchmarks/operator_benchmark.
```
python -u -m pt.unary_test --tag-filter all --device cuda
python -u -m pt.binary_test --tag-filter all --device cuda
```
On MI200 this improved performance on average 3%.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120101
Approved by: https://github.com/albanD