Delay reduction of unused parameters until first autograd hook is called (#22219)
Summary:
Reduction of gradients for unused parameters should happen as soon as
possible, because they potentially block reduction of gradients for
used parameters. This used to happen instantly when
`prepare_for_backward` was called and it found parameters that didn't
contribute. This meant that if you have a model with unused
parameters, and you want to discard the model output (i.e. not call
backward on some loss), reduction of the gradients of those unused
parameters would have been kicked off, and you'd see an error the next
time you called `forward`.
In this commit, this original approach is slightly changed to delay
reduction of the gradients of those unused parameters until the first
autograd hook is called. This means that you can now discard the model
output regardless of the model having unused parameters or not.
This is a prerequisite for making the `find_unused_parameters`
argument to DDP default to `True`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22219
Differential Revision: D16028698
Pulled By: pietern
fbshipit-source-id: c6aec2cd39c4a77746495d9cb1c9fb9c5ac61983