switch dtensor and functional collective to use optree (#110670)
optree recently landed and provide quite good perf, conditionally import
new optree if optree is installed
Some numbers testing mlp layer with TP + func collective:
before this PR: 10.390ms
after this PR: 9.189ms
so around e2e 10% CPU overhead reduction
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110670
Approved by: https://github.com/fegin