Improve checkpoint docs to warn users about detached gradient issues (#37266)
Summary:
See https://discuss.pytorch.org/t/training-with-gradient-checkpoints-torch-utils-checkpoint-appears-to-reduce-performance-of-model/78102/3?u=jwl for details.
Updated the docs to warn users about issues with checkpointing models that use `detach()` or `torch.no_grad()` to freeze their model layers/weights during training. When they do this, training with `checkpoint` will fail as it forces the outputs to require gradients when the model itself does not. Hence, during the backward pass it will output the error:
```
[4]<stderr>:RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
```
Maybe it is possible to fix this directly in the code, but I am not sure how in the current codebase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37266
Differential Revision: D21262558
Pulled By: mrshenli
fbshipit-source-id: 529cf370534504baf8937ef17dac5d6916fbf5ae