Properly initialize `grad_weight` in `raw_cudnn_convolution_backward_weight_out` (#72157)
Summary:
https://github.com/pytorch/pytorch/issues/71521 attempted to fix an issue where the `test_conv_large` test was producing `NaN` values after the backward pass, yielding a bogus comparison between the result and the expected result. While tweaking the initialization of the conv layer seemed to fix this behavior, it was actually just masking the real issue, which was that `grad_weight` is not guaranteed to be initialized in `raw_cudnn_convolution_backward_weight_out` when the backward operation is split.
Specifically, the `grad_weight` tensor is expected to be directly written to by a `cudnn` kernel (which does occur in most cases) so it does not need to be initialized, but splitting introduces an intermediate `grad_weight_` tensor that holds the intermediate gradients and then accumulates into `grad_weight` without initializing it first. This PR tweaks this behavior so that now accumulation is done with a zero'd tensor, and also adds the change of doing the accumulation in an accumulation dtype. The hacky workaround masking the issue is also reverted, with the safeguard against comparing `NaN` values (using the reference tensor for scale computation) kept in place.
CC ngimel ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72157
Reviewed By: malfet
Differential Revision: D34147547
Pulled By: ngimel
fbshipit-source-id: 056c19f727eeef96347db557528272e24eae4223
(cherry picked from commit 24c7f77a81c6ef5b0371ef0030e7003dcce55236)