Optimize CPU Transpose for a single axis moving (#2461)
* Optimize CPU Transpose for one axis moving either inwards or outwards. We have optimizations for NCHW <-> NHWC in CUDA but not CPU. This provides a more generic optimization to the CPU implementation.
Tested performance in both directions with data sizes of 8, 16, 32 and 64 bits, size of axis being moved of 3, 16 and 32, and number of elements to move of 100x100, 300x300 and 1000x1000.
Across all tests the average improvement even with the overhead of python was 2.5x. No cases were slower. Some were 6x faster.
Binary size increase in RelWithDebInfo build is ~5K.
NOTE: See PR comments for details of performance comparison with Eigen. Eigen is slightly faster but increases binary size by 55K just for support of rank 4 input. Binary size would be further increased to support different ranks.