_min_max.dim: CUDA implementation (#42943)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42943
Adds a CUDA kernel for _min_max_val.dim
Test Plan:
correctness:
```
python test/test_torch.py TestTorchDeviceTypeCUDA.test_minmax_cuda_float32
```
performance: ~50% savings on a tensor representative of quantization workloads: https://gist.github.com/vkuzo/3e16c645e07a79dd66bcd50629ff5db0
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23086797
fbshipit-source-id: 04a2d310f64a388d48ab8131538dbd287900ca4a