Implement Concat with Strided copy (#8336)
Adds a StridedCopy function that implements a copy from strided tensor to another.
This parallelizes the Concat operator, and can also be used in the future to parallelize many other data movement operators (e.g. Transpose, Split, etc.).
This operation is also required for the proposed data layout extensions to ORT.