[Quant] Vectorize scalar remainder in quantized kernel for normalization (#79673)
## Description
This PR improves performance of quantized kernel for normalize by vectorizing scalar remainder.
In the current implementation [here](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp), the computation is vectorized while the scalar remainder is handled in a `for` loop. The remainder is also vectorized to improve performance in this PR.
This kernel is for contiguous (NCHW) memory layout. For channels-last memory layout, a fast path is added in this PR https://github.com/pytorch/pytorch/pull/70520
The improvement is beneficial for layer norm, group norm and instance norm as this kernel is used for them.
## Changes
1. Add an argument `size` to `Vectorized<T>::loadu()` for vec256_qint and vec512_qint.
2. Load the remainder with the new `loadu` and do computation in the similar way as for vectorized part.
## Validation
### Test method:
Run quantized group norm with group = 2.
Op CPU time measured by `torch.profiler.profile` with warmup = 20, active = 200
### Common environment:
- Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- OS: CentOS Linux 7 (Core) (x86_64)
- Python version: 3.7.10
- Use JeMalloc memory allocator
- MALLOC_CONF=oversize_threshold:1,background_thread:true,metadata_thp:auto
- Using Intel OpenMP
- KMP_AFFINITY=granularity=fine,compact,1,0
- KMP_BLOCKTIME=1
### Case 1: AVX2
**Environment**
- GCC version: (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
- AVX2 enabled, AVX512 disabled, i.e., vec256 used
**Run a single instance on a single core**
Shape | New impl (us) | Old impl (us) | Fp32 (us) | New/old | New/fp32 | Comments
-- | -- | -- | -- | -- | -- | --
(1, 2, 8, 5) | 3.73 | 3.75 | 4.51 | 99.41% | 82.75% | Remainder size = 8
(1, 2, 8, 6) | 3.76 | 4.00 | 4.53 | 93.93% | 82.95% | Remainder size = 16
(1, 2, 8, 7) | 3.74 | 4.01 | 4.52 | 93.34% | 82.84% | Remainder size = 24
(1, 2, 8, 8) | 3.90 | 3.96 | 4.49 | 98.49% | 87.00% | No remainder
(1, 2, 8, 17) | 4.00 | 4.17 | 4.72 | 95.83% | 84.69% | Remainder size = 8
(1, 2, 8, 18) | 4.00 | 4.23 | 4.72 | 94.54% | 84.89% | Remainder size = 16
(1, 2, 8, 19) | 4.03 | 4.29 | 4.76 | 94.01% | 84.70% | Remainder size = 24
(1, 2, 8, 20) | 3.92 | 3.93 | 4.76 | 99.67% | 82.29% | No remainder
(1, 2, 8, 33) | 4.10 | 4.18 | 5.06 | 97.92% | 81.00% | Remainder size = 8
(1, 2, 8, 34) | 4.07 | 4.23 | 5.06 | 96.40% | 80.53% | Remainder size = 16
(1, 2, 8, 35) | 4.11 | 4.42 | 5.09 | 93.03% | 80.72% | Remainder size = 24
(1, 2, 8, 36) | 4.03 | 4.06 | 5.11 | 99.24% | 78.83% | No remainder
![image](https://user-images.githubusercontent.com/12522207/173979129-e393e13f-71f5-4987-95ea-ac6e0c895bd7.png)
**Run a single instance on two cores**
Shape | New impl (us) | Old impl (us) | Fp32 (us) | New/old | New/fp32 | Comments
-- | -- | -- | -- | -- | -- | --
(1, 4, 8, 5) | 5.09 | 5.24 | 5.52 | 97.17% | 92.29% | Remainder size = 8
(1, 4, 8, 6) | 5.22 | 5.50 | 5.56 | 94.95% | 93.86% | Remainder size = 16
(1, 4, 8, 7) | 5.04 | 5.60 | 5.51 | 89.97% | 91.44% | Remainder size = 24
(1, 4, 8, 8) | 5.30 | 5.29 | 5.56 | 100.23% | 95.27% | No remainder
(1, 4, 8, 17) | 5.36 | 5.56 | 6.05 | 96.53% | 88.69% | Remainder size = 8
(1, 4, 8, 18) | 5.48 | 5.71 | 6.25 | 95.99% | 87.67% | Remainder size = 16
(1, 4, 8, 19) | 5.44 | 5.81 | 6.25 | 93.65% | 87.11% | Remainder size = 24
(1, 4, 8, 20) | 5.43 | 5.34 | 6.07 | 101.76% | 89.43% | No remainder
(1, 4, 8, 33) | 5.52 | 5.58 | 6.51 | 98.89% | 84.75% | Remainder size = 8
(1, 4, 8, 34) | 5.50 | 5.71 | 6.63 | 96.22% | 82.95% | Remainder size = 16
(1, 4, 8, 35) | 5.50 | 6.16 | 6.40 | 89.33% | 85.95% | Remainder size = 24
(1, 4, 8, 36) | 5.37 | 5.48 | 6.54 | 97.94% | 81.98% | No remainder
![image](https://user-images.githubusercontent.com/12522207/173981377-6222e278-0948-4f52-809b-28899399ca65.png)
### Case 2: AVX512
**Environment**
- GCC version: (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
- AVX512 enabled, i.e., vec512 used
**Run a single instance on a single core**
Shape | New impl (us) | Old impl (us) | Fp32 (us) | New/old | New/fp32 | Comments
-- | -- | -- | -- | -- | -- | --
(1, 2, 16, 5) | 3.66 | 3.94 | 4.52 | 92.79% | 80.93% | Remainder size = 16
(1, 2, 16, 6) | 3.77 | 4.28 | 4.60 | 88.15% | 81.90% | Remainder size = 32
(1, 2, 16, 7) | 3.85 | 4.41 | 4.57 | 87.36% | 84.20% | Remainder size = 48
(1, 2, 16, 8) | 3.70 | 3.76 | 4.62 | 98.62% | 80.10% | No remainder
(1, 2, 16, 17) | 3.91 | 4.06 | 4.97 | 96.43% | 78.71% | Remainder size = 16
(1, 2, 16, 18) | 3.82 | 4.34 | 5.01 | 88.19% | 76.30% | Remainder size = 32
(1, 2, 16, 19) | 3.86 | 4.56 | 5.05 | 84.63% | 76.28% | Remainder size = 48
(1, 2, 16, 20) | 3.80 | 3.87 | 5.08 | 98.14% | 74.73% | No remainder
(1, 2, 16, 33) | 3.89 | 4.23 | 5.65 | 91.94% | 68.85% | Remainder size = 16
(1, 2, 16, 34) | 3.91 | 4.46 | 5.70 | 87.68% | 68.61% | Remainder size = 32
(1, 2, 16, 35) | 4.04 | 4.68 | 5.72 | 86.44% | 70.64% | Remainder size = 48
(1, 2, 16, 36) | 4.00 | 3.99 | 5.71 | 100.28% | 69.96% | No remainder
![image](https://user-images.githubusercontent.com/12522207/173982490-4687c5bc-50e8-49aa-9fe2-7967c738dbfb.png)
**Run a single instance on two cores**
Shape | New impl (us) | Old impl (us) | Fp32 (us) | New/old | New/fp32 | Comments
-- | -- | -- | -- | -- | -- | --
(1, 4, 16, 5) | 5.43 | 5.53 | 5.92 | 98.12% | 91.60% | Remainder size = 16
(1, 4, 16, 6) | 5.35 | 5.85 | 6.05 | 91.53% | 88.54% | Remainder size = 32
(1, 4, 16, 7) | 5.31 | 6.04 | 6.18 | 87.97% | 85.93% | Remainder size = 48
(1, 4, 16, 8) | 5.30 | 5.27 | 6.30 | 100.66% | 84.16% | No remainder
(1, 4, 16, 17) | 5.47 | 5.67 | 6.48 | 96.51% | 84.45% | Remainder size = 16
(1, 4, 16, 18) | 5.53 | 5.86 | 6.59 | 94.28% | 83.78% | Remainder size = 32
(1, 4, 16, 19) | 5.48 | 6.13 | 6.57 | 89.39% | 83.38% | Remainder size = 48
(1, 4, 16, 20) | 5.35 | 5.31 | 6.95 | 100.79% | 76.91% | No remainder
(1, 4, 16, 33) | 5.62 | 5.77 | 7.31 | 97.28% | 76.80% | Remainder size = 16
(1, 4, 16, 34) | 5.56 | 5.85 | 7.06 | 95.03% | 78.71% | Remainder size = 32
(1, 4, 16, 35) | 5.67 | 6.10 | 7.09 | 93.03% | 79.98% | Remainder size = 48
(1, 4, 16, 36) | 5.50 | 5.39 | 7.20 | 102.15% | 76.42% | No remainder
![image](https://user-images.githubusercontent.com/12522207/173982748-5f003630-18a4-4c3d-a643-b8711892cc39.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79673
Approved by: https://github.com/jerryzh168