Fix grid_sample out of boundary when grid contains large numbers (#35506)
Summary:
This PR would fix https://github.com/pytorch/pytorch/issues/35202, fix GPU part of https://github.com/pytorch/pytorch/issues/24823, be related to https://github.com/pytorch/pytorch/issues/24870.
Here is the origin of this problem.
1. Like those in https://github.com/pytorch/pytorch/issues/35202, with large numbers in grid like `grid.min() == -10059144 grid.max()==67680944`; or `nan, inf, 1.0E20` in https://github.com/pytorch/pytorch/issues/24823,
https://github.com/pytorch/pytorch/blob/4d39aeec271fde5a89aa68c7588023205c5ca8a9/aten/src/ATen/native/cuda/GridSampler.cu#L309-L321
`ix, iy` will be unnormalized to very large numbers, exceed the bound of INT_MAX.
Then, those `ix_nw, iy_nw` variables will be cast to INT_MAX, and some other variables with "+1" will be INT_MIN.
2. However, these INT_MAX, INT_MIN should not big problems, because
https://github.com/pytorch/pytorch/blob/4d39aeec271fde5a89aa68c7588023205c5ca8a9/aten/src/ATen/native/cuda/GridSampler.cu#L358-L362
https://github.com/pytorch/pytorch/blob/4d39aeec271fde5a89aa68c7588023205c5ca8a9/aten/src/ATen/native/cuda/GridSampler.cuh#L202-L205
these `within_bounds_2d` functions are supposed to guard the if-statement, prevent the illegal memory access, and leave those output values as zero (padding_modes='zeros').
3. Now here comes the problem, `within_bounds_2d` is set to "inline". We found that those `+1` statement and `>=0` statement may cause compiler to "optimize" the code, that is:
```cpp
int B = something;
int a = something;
int b = a + 1;
bool r = (b >= 0 && b < B);
```
will be compiled into assembly code like
```cpp
int B = something;
int a = something;
bool r1 = (a > -2)
int b = a + 1;
bool r2 = (b < B);
bool r = r1 && r2;
```
This looks nice, but when a = INT_MAX, `a+1` causes Undefined Behavior. Typically, we get b = INT_MIN, then the boolean result from compiled assembly will be true. The `within_bounds_2d` no longer guards us from the illegal memory access.
4. There could be different ways to fix this bug. For example, we may set all of the "ix_nw, iy_nw" values to `int64_t`. That would be a potential performance issue, and doesn't prevent those examples in https://github.com/pytorch/pytorch/issues/24823 with 1E20 in grid.
One minimal fix that I found is to restrict `within_bounds_2d` from being inlined. Thus, compiler won't optimize those `a+1` and `a>=0` code together.
I did a short performace test, just to make sure this forced noinline solution won't cause regression. The performance script can be found at
https://github.com/xwang233/code-snippet/blob/a6f8bce52222cd1c5270e22a87a4699b65741686/grid-sample/grid-sample.ipynb.
For this `__attribute__((noinline))` macro, I have tested that on nvcc, and there was no problem. I'm not sure if that also works on clang.
cc csarofeen ptrblck ngimel bnehoran zasdfgbnm SsnL
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35506
Differential Revision: D20799304
Pulled By: ngimel
fbshipit-source-id: fc70289b35039fad954908a990ab0a2f16fbfcb2