Corrected comments in fsdp (#80456)
Currently, pre- and post-division steps in `FullyShardedDataParallel._post_backward_hook` state the following:
> Average grad by world_size for consistency with PyTorch DDP.
This is not matching what is actually going on, i.e. pre-divide factor may be equal to `world_size` and may not.
For example, for `world_size = 3 `, `predivide_factor=2`
This PR clarifies pre- and post-division in the code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80456
Approved by: https://github.com/rohan-varma