optimize cat performance on CPU with TensorIterator (#30806)
Summary:
This PR aims at improving `cat` performance on CPU.
Current `cat` logic from `TH` module has no parallelization when the input tensor array are all contiguous.
This code also try to reuse the same `TensorIterator` as much as possible, in order to reduce overhead of creating `TensorIterator`, this is helpful when the slice of copy is not large enough.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30806
Differential Revision: D19275026
Pulled By: VitalyFedyunin
fbshipit-source-id: 756e9b86891f725c256b0a6981887ff06d88b053