Splitting #91254 into two PRs (#92748)
This one handles the xnumel=1 part, and introduces no performance
regression.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92748
Approved by: https://github.com/lezcano, https://github.com/jansel