Replace cpu_apply with TensorIterator inside of Copy function (#18618)
Summary:
Replace cpu_apply functions with the TensorIterator.
Vectorize copy and clone functions.
Move big pieces of the code to cpu kernels folder to be able to use AVX2.
Add fast path for copy_ function if tensor types matches.
Slow down observed on smaller tensors (up to 10% or about 1us per op.) which might be explained by the bigger CPU footprint of TensorInterator in compare to simpler cpu_apply. COntrary on bigger tensors we can see 2x-3x performance improvement (single threaded, multithreading giving even better performance boost).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18618
Differential Revision: D14954118
Pulled By: VitalyFedyunin
fbshipit-source-id: 9d9bdf3fd9d5e539a03071cced50d0a47bac1615