Improve BatchNorm1d performance (CUDA) (#57786)
Summary:
Part of gh-38915, resubmit of gh-57034
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57786
Reviewed By: mruberry
Differential Revision: D28290284
Pulled By: ngimel
fbshipit-source-id: 8768578ba9ace6a948cb8145c0091e0ea49b12da