add Half support for BatchNorm on CPU (#102070)
Fixes #106543
### Testing
Single core:
shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.7116 | 0.1427 | 0.1744 | 0.2638 | 0.2002 | 0.2556
(1, 32, 100, 100) | 0.8579 | 0.1725 | 0.2077 | 0.3023 | 0.2399 | 0.2995
(32, 16, 200, 200) | 57.3466 | 12.2179 | 13.1320 | 45.9524 | 24.1526 | 24.9882
28 cores:
shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.2571 | 0.0713 | 0.0846 | 0.1140 | 0.0883 | 0.1043
(1, 32, 100, 100) | 0.1077 | 0.0510 | 0.0548 | 0.0700 | 0.0645 | 0.0713
(32, 16, 200, 200) | 5.5060 | 1.4195 | 1.4663 | 6.773 | 3.0886 | 3.1343
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070
Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki