pytorch
cb39a540 - Use C10_WARP_SIZE to fix functionality on HIP vs CUDA for batch_norm_backward_reduce (#33098)

Commit View On GitHub

Commit

4 years ago

Use C10_WARP_SIZE to fix functionality on HIP vs CUDA for batch_norm_backward_reduce (#33098) Summary: 1. Use C10_WARP_SIZE instead of hardcoded value "32". 2. `getNumThreads` returns a minimum of 32 for CUDA, which is same as the warp size in CUDA. However, for HIP, it returns a minimum of 16, which is less than the warp size (64) in HIP. This creates an issue in the [reduce function](https://github.com/pytorch/pytorch/blob/14548c2d5b40d78f1b45376119eaecd297a83e6c/aten/src/ATen/native/cuda/Normalization.cuh#L115) when it zeroes out the other entries in shared memory [here](https://github.com/pytorch/pytorch/blob/14548c2d5b40d78f1b45376119eaecd297a83e6c/aten/src/ATen/native/cuda/Normalization.cuh#L137): since `blockDim.x` is at least equal to the warp size in CUDA, this never zeroes out `shared[0]`, but for HIP, since `blockDim.x` could be 16 or 32, which is less than the warp size (64), this results in `blockDim.x * blockDim.y` being potentially less than the warp size for small cases, which then zeroes out `shared[0]` as well. This results in an erroneous output of zero for the reduce function on ROCm (depending on how the block dimensions are set). Pull Request resolved: https://github.com/pytorch/pytorch/pull/33098 Differential Revision: D19837355 Pulled By: bddppq fbshipit-source-id: ea526acd82ec08b1acb25be860b7e663c38ff173

Author

jithunnair-amd

Committer

facebook-github-bot

Parents

44723a1c

pytorch cb39a540 - Use C10_WARP_SIZE to fix functionality on HIP vs CUDA for batch_norm_backward_reduce (#33098)

Commit

pytorch
cb39a540 - Use C10_WARP_SIZE to fix functionality on HIP vs CUDA for batch_norm_backward_reduce (#33098)