add Half support for GroupNorm on CPU (#100234)
### Testing
Single socket (28cores):
* Contiguous:
shape | forward / s| forward / s| backward / s| backward / s
-- | -- | -- | -- | --
| fp32 | mixed fp32 fp16 | fp32 | mixed fp32 fp16
[10, 128, 10, 10] | 2.45E-05 | 3.26E-05 | 6.87E-05 | 7.40E-05
[10, 128, 80, 80] | 0.000726 | 0.000606 | 0.002183 | 0.001112
* Channels Last:
shape | forward / s| forward / s| backward / s| backward / s
-- | -- | -- | -- | --
| fp32 | mixed fp32 fp16 | fp32 | mixed fp32 fp16
[10, 128, 10, 10] | 2.88E-05 | 2.72E-05 | 6.56E-05 | 6.63E-05
[10, 128, 80, 80] | 0.00076 | 0.000256 | 0.002385 | 0.000735
Single core:
* Contiguous:
shape | forward / s| forward / s| backward / s| backward / s
-- | -- | -- | -- | --
| fp32 | mixed fp32 fp16 | fp32 | mixed fp32 fp16
[10, 128, 10, 10] | 9.47E-05 | 1.90E-04 | 2.03E-04 | 3.10E-04
[10, 128, 80, 80] | 6.25E-03 | 8.98E-03 | 0.016485 | 0.01369
* Channels Last:
shape | forward / s| forward / s| backward / s| backward / s
-- | -- | -- | -- | --
| fp32 | mixed fp32 fp16 | fp32 | mixed fp32 fp16
[10, 128, 10, 10] | 8.66E-05 | 7.89E-05 | 1.95E-04 | 1.43E-04
[10, 128, 80, 80] | 5.97E-03 | 3.13E-03 | 0.01626 | 8.70E-03
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100234
Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki