Optimize Concat and Split on CUDA to eliminate host-to-device copies when sizes are all the same (#8833)
* special case concat and split when sizes are equal
* add tests for 16 and 32 inputs with same dim
* add tests for 16/64 inputs on concat or 16/64 outputs on split
* try eliminate windows warning
* outter => outer