Remove cudaMemcpy on full memory overlap (#34548)
Summary:
TensorIterator is already checking partial overlap, so there is no trivial UB, but TensorITerator allows full overlap, and it is not a bad idea to skip the memcpy in such case.
fixes: https://github.com/pytorch/pytorch/issues/34525
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34548
Differential Revision: D20371643
Pulled By: ngimel
fbshipit-source-id: ff9e2e872537010afe040204e008b2499af963ad