add max_and_min function and cpu kernel to speed up observers (#41570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41570
For min/max based quantization observers, calculating min and max of a tensor
takes most of the runtime. Since the calculation of min and max is done
on the same tensor, we can speed this up by only reading the tensor
once, and reducing with two outputs.
One question I had is whether we should put this into the quantization
namespace, since the use case is pretty specific.
This PR implements the easier CPU path to get an initial validation.
There is some needed additional work in future PRs, which durumu will
take a look at:
* CUDA kernel and tests
* making this work per channel
* benchmarking on observer
* benchmarking impact on QAT overhead
Test Plan:
```
python test/test_torch.py TestTorch.test_min_and_max
```
quick bench (not representative of real world use case):
https://gist.github.com/vkuzo/7fce61c3456dbc488d432430cafd6eca
```
(pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=1 python ~/nfs/pytorch_scripts/observer_bench.py
tensor(5.0390) tensor(-5.4485) tensor([-5.4485, 5.0390])
min and max separate 11.90243935585022
min and max combined 6.353186368942261
% decrease 0.466228209277153
(pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=4 python ~/nfs/pytorch_scripts/observer_bench.py
tensor(5.5586) tensor(-5.3983) tensor([-5.3983, 5.5586])
min and max separate 3.468616485595703
min and max combined 1.8227086067199707
% decrease 0.4745142294372342
(pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=8 python ~/nfs/pytorch_scripts/observer_bench.py
tensor(5.2146) tensor(-5.2858) tensor([-5.2858, 5.2146])
min and max separate 1.5707778930664062
min and max combined 0.8645427227020264
% decrease 0.4496085496757899
```
Imported from OSS
Reviewed By: supriyar
Differential Revision: D22589349
fbshipit-source-id: c2e3f1b8b5c75a23372eb6e4c885f842904528ed