[PT-D][Sharding] Enable more ops needed in the transformer model training
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77214
From the code base of MetaSeq Model, we have found that loads of ops are not supported by sharded tensor. In https://github.com/pytorch/pytorch/pull/75374, we have enabled most of ops already and this PR/diff aims at enabling the rest of them.
Fix some unit test errors.
Differential Revision: [D36302780](https://our.internmc.facebook.com/intern/diff/D36302780/)
Approved by: https://github.com/wanchaol, https://github.com/pritamdamania87