_min_max_val.dim: CPU implementation (#42894)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42894
Continuing the min_max kernel implementation, this PR adds the
CPU path when a dim is specified. Next PR will replicate for CUDA.
Note: after a discussion with ngimel, we are taking the fast path
of calculating the values only and not the indices, since that is what
is needed for quantization, and calculating indices would require support
for reductions on 4 outputs which is additional work. So, the API
doesn't fully match `min.dim` and `max.dim`.
Flexible on the name, let me know if something else is better.
Test Plan:
correctness:
```
python test/test_torch.py TestTorchDeviceTypeCPU.test_minmax_cpu_float32
```
performance: seeing a 49% speedup on a min+max tensor with similar shapes
to what we care about for quantization observers (bench:
https://gist.github.com/vkuzo/b3f24d67060e916128a51777f9b89326). For
other shapes (more dims, different dim sizes, etc), I've noticed a
speedup as low as 20%, but we don't have a good use case to optimize
that so perhaps we can save that for a future PR.
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23086798
fbshipit-source-id: b24ce827d179191c30eccf31ab0b2b76139b0ad5