Optimize grid sample 3d
Fixes #71415
I have implemented the changes that replicate what @to-mi did in this [PR](https://github.com/pytorch/pytorch/pull/65986#issue-1012959443) for the 3D case :
> Fixes #64977
>
> Avoids creating a tensor for and calculating `input` gradient if it's not needed in the backward pass of `grid_sample` (2d case, native CPU & CUDA kernels). Especially the tensor creation seemed time consuming (see #64977).
>
> Brief description of the changes:
>
> * I have tried to go with rather minimal changes. It would probably be possible to make a more elegant version with a bit larger refactoring (or possibly with better understanding of PyTorch internals and C++ functionalities).
>
> * Changed the `native_functions.yaml` and `derivatives.yaml` so that the gradient input mask is passed to the functions.
>
> * Changed the CPU kernels:
> (1) added `bool input_requires_grad` template parameter to the `backward` function,
> (2) added if branches based on it to remove `input` gradient computations if it's not requested,
> (3) feed in `TensorAccessor<scalar_t, 3>* gInp_slice_ptr` instead of `TensorAccessor<scalar_t, 3>& gInp_slice` so that I can pass a `nullptr` in case gradient for `input` is not requested. (A bit inelegant perhaps, but allows to keep one signature for `backward` function and not require breaking it to smaller pieces. Perhaps there's a more elegant way to achieve this?)
>
> * Changed CUDA kernel:
> (1) added ~`bool input_requires_grad` template parameter~ `const bool input_requires_grad` argument to the `backward` function,
> (2) added if branches based on it to remove `input` gradient computations if it's not requested,
> (3) feed in `TensorInfo<scalar_t, index_t>()` instead of `getTensorInfo<scalar_t, index_t>(grad_input)` in case gradient for `input` is not requested.
>
> * Modified tests in `test/test_nn.py` so that they run also cases with no `input` gradient needed.
>
> * Have not touched the CPU fallback kernel.
Note: the changes number (3) are N/A in this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71759