Fix GroupNorm backward prop on CUDA (#92671)
Fixes regression introduced by https://github.com/pytorch/pytorch/pull/89485
Adds test to prevent those regressions from happening in the future In process, discovered that GroupNormBackwards on CPU does not produce the same results if input and gradient memory_format is different
Fixes #92166
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92671
Approved by: https://github.com/ngimel, https://github.com/xuzhao9