Re-enable AVX512 ATen kernels for compute-intensive ops (#104165)
## Summary
Enables AVX512 dispatch by default for some kernels, for which AVX512 performs better than AVX2.
For other kernels, their AVX2 counterparts are used.
## Implementation details
`REGISTER_DISPATCH` should now only be used for non-AVX512 dispatch.
`ALSO_REGISTER_AVX512_DISPATCH` should be used when AVX512 dispatch should also be done for a kernel.
## Benchmarking results with #104655
[Raw data at GitHub Gist (Click on `Download ZIP`)](https://gist.github.com/sanchitintel/87e07f84774fca8f6b767aeeb08bc0c9)
| Op | Speedup of AVX512 over AVX2 |
|----|------------------------------------|
|sigmoid|~27% with FP32|
|sign| ~16.6%|
|sgn|~15%|
|sqrt|~4%|
|cosh|~37%|
|sinh|~37.5%|
|acos| ~8% with FP32 |
|expm1| ~30% with FP32|
|log|~2%|
|log1p|~16%|
|erfinv|~6% with FP32|
|LogSigmoid|~33% with FP32|
|atan2|~40% with FP32|
|logaddexp| ~24% with FP32|
|logaddexp2| ~21% with FP32|
|hypot| ~24% with FP32|
|igamma|~4% with FP32|
|lgamma| ~40% with FP32|
|igammac|3.5%|
|gelu|~3% with FP32|
|glu|~20% with FP32|
|SiLU|~35% with FP32|
|Softplus|~33% with FP32|
|Mish|~36% with FP32|
|Hardswish|~7% faster with FP32 when tensor can fit in L2 cache|
|Hardshrink|~8% faster with FP32 when tensor can fit in L2 cache|
|Softshrink|~10% faster with FP32 when tensor can fit in L2 cache|
|Hardtanh|~12.5% faster with FP32 when tensor can fit in L2 cache|
Hardsigmoid|~7% faster with FP32 when tensor can fit in L2 cache|
|hypot|~35%|
|atan2|~37%|
|dequantize per channel|~10%|
## Insights gleaned through collected data (future action-items):
1. Inplace variants of some ops are faster with AVX512 although the functional variant may be slower for FP32. Will enable AVX512 dispatch for the inplace variants of such kernels.
2. Almost all BF16 kernels are faster with AVX512, so after PyTorch 2.1 release, will enable AVX512 dispatch for BF16 kernels whose corresponding FP32 kernel doesn't perform well with AVX512.
3. Some kernels rely on auto-vectorization & might perform better with AVX512 once explicit vectorization would be enabled for them.
Data was collected with 26 physical threads of one socket of Intel Xeon 8371HC. Intel OpenMP & tcmalloc were preloaded.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104165
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/kit1980