Bounds checking for functor execution in vectorized/unrolled kernels (#33642)
Summary:
The current logic for vectorized/unrolled operations in CUDALoops.cuh applies bounds checking to loads and stores, [but not to the actual functor's execution](https://github.com/pytorch/pytorch/blob/16d6c17845426294274850f9161e292345f2afa5/aten/src/ATen/native/cuda/CUDALoops.cuh#L264). In other words, for a block acting on the tail of a tensor that doesn't require the whole block to participate in memory transactions, many threads execute their functor on uninitialized data. For functors that only communicate with the outside world via the bounds-checked loads and stores, that's ok. The threads acting on garbage data never actually write their results. But [my proposed inf/nan checking kernel](https://github.com/pytorch/pytorch/pull/33366/files#diff-9701a2b34900195d160bdc234e001b79R70-R79) has the additional side effect of writing to a `found_inf` flag in global memory. For irregularly-shaped tensors where tail threads execute the functor on garbage data, these threads would sometimes see and report spurious infs/nans.
In general, we can't guarantee functors won't have side effects. For safety (and efficiency) we should apply bounds checking to the functor execution as well as the loads and stores.
Is it possible that other elementwise kernels (in addition to the strided/vectorized implementation) are also executing functors unconditionally? That would cause similar failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33642
Differential Revision: D20062985
Pulled By: ngimel
fbshipit-source-id: 65b8d75a001ce57865ed1c0cf89105d33f3f4dd4