Improve DDPOptimizer by avoiding small preamble graph (#93162)
This optimizes an edge case where some compute-only ops (e.g. add)
could end up in an orphan graph at the input side due to the bucket
for the next graph being full already. The fix is to fuse this
graph (which is "empty" in parameter count) together with the adjoining
"full" bucket.
Note: i encountered this when trying to repro some suspected duplicate
argument errors, but this is unrelated and I have not yet repro'd
a duplicate arg issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93162
Approved by: https://github.com/davidberard98