Add scalar conversion using avx instructions for half (#102140)
### Motivation
Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up.
### Testing
Test maxpool, and compared with the results of #98819.
Single socket (28 cores):
shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 5.07165 | 5.418 | 0.5798 | 0.5123 | 1.373694951 | 3.430786
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 1.37455 | 1.2505 | 8.8336 | 9.7684 | 1.373635008 | 4.132924
size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 28.72 | 30.7069 | 3.813 | 3.75 | 1.31977124 | 2.783006
size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 4.5783 | 4.703 | 4.703 | 5.1 | 1.028980189 | 3.1293
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 13.896 | 14.8138 | 1.6635 | 1.6274 | 1.298704663 | 2.982699
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 2.11291 | 2.1158 | 2.26778 | 2.272 | 0.951105348 | 3.179012
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 0.4204 | 0.3843 | 0.0649 | 0.0633 | 2.102711703 | 1.779492
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 0.1134 | 0.11 | 0.1476 | 0.143 | 2.23042328 | 3.612398
Single core:
shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 124.413 | 114.44 | 10.553 | 11.2486 | 1.31395433 | 3.923844
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 28.99 | 28.0781 | 9.5092 | 10.9258 | 1.324296999 | 3.888377
size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 640.8276 | 591.964 | 59.18776 | 60.854 | 1.334956391 | 3.704458
size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 88.57 | 90.214 | 54.358 | 59.205 | 1.031258214 | 3.75285
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 318.6197 | 285.155 | 28.4999 | 29.4387 | 1.315298144 | 3.759747
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 31.3981 | 34.0544 | 25.6557 | 28.7811 | 1.068505738 | 3.841587
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 8.87882 | 8.207 | 0.386056 | 0.3939 | 1.567866 | 3.50387
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 2.4167 | 2.38295 | 0.3769 | 0.4066 | 1.39402491 | 3.30061
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102140
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/cpuhrsch