remove unnecessary __syncthreads() in conv_depthwise2d_grad_weight_kernel (#84854)
Threads within a thread block would be synchronize inside the function BlockReduceSum when intra-warp reduce finishes. It's unnessary to synchronize threads before invoking function BlockReduceSum.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84854
Approved by: https://github.com/ngimel