min_max kernel: add CUDA (#42868)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42868
Adds a CUDA kernel for the _min_max function.
Note: this is a re-submit of https://github.com/pytorch/pytorch/pull/41805,
was faster to resubmit than to ressurect that one. Thanks to durumu
for writing the original implementation!
Future PRs will add index support, docs, and hook this up to observers.
Test Plan:
```
python test/test_torch.py TestTorchDeviceTypeCUDA.test_minmax_cuda_float32
```
Basic benchmarking shows a 50% reduction in time to calculate min + max:
https://gist.github.com/vkuzo/b7dd91196345ad8bce77f2e700f10cf9
TODO
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23057766
fbshipit-source-id: 70644d2471cf5dae0a69343fba614fb486bb0891