[pytorch] activation checkpointing: enable mixing tensor without requires_grad (#45934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45934
https://pytorch.org/docs/stable/checkpoint.html pytorch checkpoint requires all input to the function being checkpointed to requires_grad, but this assumption is not necessarily try. consider the following two examples
```
output = MultiheadedMaskedAtten(input, mask)
output = LSTM(input, seq_length)
```
both length and mask are tensors that won't requires grad, currently if you try to checkpoint torch.autograd.backward will complain
```
File "/mnt/xarfuse/uid-124297/7d159c34-seed-nspid4026531836-ns-4026531840/torch/autograd/function.py
", line 87, in apply
return self._forward_cls.backward(self, *args)
File "/mnt/xarfuse/uid-124297/7d159c34-seed-nspid4026531836-ns-4026531840/torch/utils/checkpoint.py"
, line 99, in backward
torch.autograd.backward(outputs, args)
File "/mnt/xarfuse/uid-124297/7d159c34-seed-nspid4026531836-ns-4026531840/torch/autograd/__init__.py
", line 132, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: element 1 of tensors does not require grad and does not have a grad_fn
```
this diff allows skipping the non-grad-requiring tensor when running autograd.backward.
added documentation for this feature as well.
Test Plan: added unit test to make sure partial tensor grads can be used in checkpoint().
Differential Revision: D24094764
fbshipit-source-id: 6557e8e74132d5a392526adc7b57b6998609ed12