[MPS] Fix mps to cpu casting from a smaller dtype to a bigger dtype (#84928)
Fixes #82566 , #80800
- mps->cpu casts from a smaller dtype to a bigger dtype mps->mps cast from smaller/bigger dtype to another dtype in case of scatter
- For mps->cpu copies where we don't have a source/destination offset, we can save the cast result directly in the destTensor, so we can skip the additional overhead of the blit.
- In case we can return the data without doing the blit, we need to check if it's blocking call, case in which we'd need a synchronize(SyncType::COMMIT_AND_WAIT); call (previously this was done by the blit).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84928
Approved by: https://github.com/razarmehr