Fix CUDA shared memory out of bound access in findPattern (#28989)
Summary:
This fixes https://github.com/pytorch/pytorch/issues/28789
Only the first two elements of `smem` are used in this function but at the beginning, it resets all the `C10_WARP_SIZE` to 0. When the `scalar_t` is 64bit, it goes out of the total shared memory size which is `sizeof(int) * C10_WARP_SIZE`, although this does not lead to any failure in CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28989
Differential Revision: D18271598
Pulled By: ngimel
fbshipit-source-id: 38cc863722509892646f719efb05e2730a7d9ae1