Stop tracking backward chain of broadcast in initialization (#5075)
DeepSpeed engine generates the following warning upon initialization.
This warning is triggered by a broadcast that synchronizes model
parameters across ranks. Although this is harmless in terms of both
accuracy and, likely, performance, it may confuse users and potentially
cause compatibility issues with future versions of PyTorch.
This PR runs the broadcast within a `torch.no_grad` context to prevent
tracking of the backward computation chain.
```
/home/aiscuser/.conda/envs/wbcast/lib/python3.9/site-packages/torch/autograd/__init__.py:266: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1704987277512/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
```
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>