Enable parallel output reordering in MlasReorderOutputNchw() (#13643)
### Description
This PR speeds-up the output reordering operation (as implemented in
[MlasReorderOutputNchw](https://github.com/microsoft/onnxruntime/blob/9954454c65086c49b7c00f83b23ada76975f3546/onnxruntime/core/mlas/lib/reorder.cpp#L400))
by replacing the sequential implementation with a parallelized one. The
parallelization is achieved through the use of the existing
[TryBatchParallelFor](https://github.com/microsoft/onnxruntime/blob/9954454c65086c49b7c00f83b23ada76975f3546/include/onnxruntime/core/platform/threadpool.h#L284)
construct.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The output reordering operation is frequently executed in image
processing models.
Its implementation can be easily parallelized and therefore sped up when
executed on a multi-core machine.
The amount of speedup achieved by this PR varies and depends on the
actual input.
The table below summarizes the results of some of the experiments I have
conducted on a 16-core VM running on an AMD EPYC 7742 64-core processor.
The experiment is based on the existing [unit
test](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/mlas/unittest/test_reorder_output.cpp)
for the output reordering operation. The first column represents the
shape of the output as BatchCount:Channels:Height:Width, and the numbers
in other columns represent the latency (in us, on average out of 100
runs) for the tested variants. Specifically, I compare the (sequential)
baseline (in second column) with the (parallelized) variants, each using
a number of worker threads equal to 1, 2, 4, 8 or 16 (as specified in
[the constructor to the threadpool
object](https://github.com/microsoft/onnxruntime/blob/9954454c65086c49b7c00f83b23ada76975f3546/onnxruntime/test/mlas/unittest/test_main.cpp#L12)).
The numbers in () represent the speedup over the baseline.
| Input | baseline | 1 Thread | 2 Threads | 4 Threads | 8 Threads | 16
Threads|
| ------------- | -------------
|---------------|---------------|---------------|---------------|---------------|
1:1:112:112 | 20.8 | 21.5 (x0.97) | 21.9 (x0.95) | 22.2 (x0.94) | 22.5
(x0.92) | 23.0 (x0.90) |
1:128:160:84 | 540.4 | 712.5 (x0.76) | 404.0 (x1.34) | 327.8 (x1.65) |
377.9 (x1.43) | 371.8 (x1.45) |
13:240:4:314 | 1484.0 | 1851.1 (x0.80) | 1080.9 (x1.37) | 570.2 (x2.60)
| 531.8 (x2.79) | 511.2 (x2.90) |
13:96:4:314 | 471.0 | 679.9 (x0.69) | 427.2 (x1.10) | 372.1 (x1.27) |
445.5 (x1.06) | 428.5 (x1.10) |
1:64:320:168 | 1215.1 | 1497.8 (x0.81) | 863.8 (x1.41) | 456.7 (x2.66) |
435.7 (x2.79) | 462.5 (x2.63) |
30:240:4:140 | 1711.5 | 2181.4 (x0.78) | 1182.6 (x1.45) | 657.4 (x2.60)
| 592.5 (x2.89) | 578.0 (x2.96) |
30:336:4:140 | 2432.5 | 3039.2 (x0.80) | 1695.6 (x1.43) | 920.7 (x2.64)
| 817.1 (x2.98) | 819.2 (x2.97) |
The initial drop between the baseline and the variant using just one
worker thread can be attributed to the overhead of invoking the
reordering loop as a functor in TryBatchParallelFor. This overhead is
compensated by the speedup of parallel processing when the number of
worker threads is increased.