use float as accumulate type for reduce Ops: min, max, minmax on CPU (#96079)
Use float32 as acc type for `min`, `max` and `minmax`, in the function ` vec::reduce_all`, float16 inputs will be accumulated in float32.
The performance benefit basically comes from the vectorization of `Half` https://github.com/pytorch/pytorch/pull/96076
Tested on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
**single socket**
```
(before)
### using OMP_NUM_THREADS=20
### using numactl --physcpubind=0-19 --membind=0
max: size: torch.Size([64, 128, 1024]) 2.071 ms
(after)
### using OMP_NUM_THREADS=20
### using numactl --physcpubind=0-19 --membind=0
max: size: torch.Size([64, 128, 1024]) 0.071 ms
```
**single core**
```
(before)
### using OMP_NUM_THREADS=1
### using numactl --physcpubind=0 --membind=0
max: size: torch.Size([64, 128, 1024]) 33.488 ms
(after)
### using OMP_NUM_THREADS=1
### using numactl --physcpubind=0 --membind=0
max: size: torch.Size([64, 128, 1024]) 0.953 ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96079
Approved by: https://github.com/jgong5, https://github.com/kit1980