[dtensor] make replicate -> partial do division instead (#110898)
This PR switches the replicate -> partial to do division instead of
zeroing out other ranks, it preserve same numerics, but avoid the
per-rank behavior difference, and friendly to torch compile
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110898
Approved by: https://github.com/fduwjj