[PT-D][BE][TP perf 1/N] Get rid of unnecessary collectives in Embedding/EmbeddingBag and use autograd-enabled collectives (#81853)
These two ops (Embedding and EmbeddingBag for ShardedTensor) especially for row-wise sharding is very inefficient and hard to fit in the concept of future design. So this PR is trying to:
1. Remove all unnecessary collective communications. Only one gather and one reduce(or reduce scatter) is needed.
2. Use auto-grad enabled collectives so that we can use these ops in real model training.
3. Some minor code cleaning
4. Treat input differently when it's replicated tensor. (Will add more for this for the next few PRs).
Differential Revision: [D37965687](https://our.internmc.facebook.com/intern/diff/D37965687/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81853
Approved by: https://github.com/wanchaol