[inductor] Skip welford combine on first reduciton loop iteration (#121488)
On the first iteration we short circuit `welford_reduce` since we know
the accumulators are filled with the default values.
This is split out from #120330 to hopefully avoid the meta-internal failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121488
Approved by: https://github.com/lezcano