Use C10_WARP_SIZE to fix functionality on HIP vs CUDA for batch_norm_backward_reduce (#33098)
Summary:
1. Use C10_WARP_SIZE instead of hardcoded value "32".
2. `getNumThreads` returns a minimum of 32 for CUDA, which is same as the warp size in CUDA. However, for HIP, it returns a minimum of 16, which is less than the warp size (64) in HIP. This creates an issue in the [reduce function](https://github.com/pytorch/pytorch/blob/14548c2d5b40d78f1b45376119eaecd297a83e6c/aten/src/ATen/native/cuda/Normalization.cuh#L115) when it zeroes out the other entries in shared memory [here](https://github.com/pytorch/pytorch/blob/14548c2d5b40d78f1b45376119eaecd297a83e6c/aten/src/ATen/native/cuda/Normalization.cuh#L137): since `blockDim.x` is at least equal to the warp size in CUDA, this never zeroes out `shared[0]`, but for HIP, since `blockDim.x` could be 16 or 32, which is less than the warp size (64), this results in `blockDim.x * blockDim.y` being potentially less than the warp size for small cases, which then zeroes out `shared[0]` as well. This results in an erroneous output of zero for the reduce function on ROCm (depending on how the block dimensions are set).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33098
Differential Revision: D19837355
Pulled By: bddppq
fbshipit-source-id: ea526acd82ec08b1acb25be860b7e663c38ff173