[dtensor] refactor sharding cost model to count for latency (#119897)
This PR refactors the shardeing cost model, to do a more accurate
estimation of redistribute cost, including both collective latency and
communciation time.
The previous cost model does not recale the latency and communciation
time, therefore the latency factor is too small to be counted, and in
the case of small tensors, multiple collectives is preferred than a
single collective, which is wrong.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119897
Approved by: https://github.com/tianyu-l