[inductor] Optimize welford reduction (#120330)
This does two things,
1) Short circuit `welford_reduce` on the first iteration to ignore the accumulator (big win for small `rnumel`)
2) Replace division with multiplication by reciprocal
Currently this is not enough to match two pass reduction with bfloat16 but it is still a significant improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120330
Approved by: https://github.com/lezcano