onnxruntime
10ab2529 - Enable parallel output reordering in MlasReorderOutputNchw() (#13643)

Commit
3 years ago
Enable parallel output reordering in MlasReorderOutputNchw() (#13643) ### Description This PR speeds-up the output reordering operation (as implemented in [MlasReorderOutputNchw](https://github.com/microsoft/onnxruntime/blob/9954454c65086c49b7c00f83b23ada76975f3546/onnxruntime/core/mlas/lib/reorder.cpp#L400)) by replacing the sequential implementation with a parallelized one. The parallelization is achieved through the use of the existing [TryBatchParallelFor](https://github.com/microsoft/onnxruntime/blob/9954454c65086c49b7c00f83b23ada76975f3546/include/onnxruntime/core/platform/threadpool.h#L284) construct. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The output reordering operation is frequently executed in image processing models. Its implementation can be easily parallelized and therefore sped up when executed on a multi-core machine. The amount of speedup achieved by this PR varies and depends on the actual input. The table below summarizes the results of some of the experiments I have conducted on a 16-core VM running on an AMD EPYC 7742 64-core processor. The experiment is based on the existing [unit test](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/mlas/unittest/test_reorder_output.cpp) for the output reordering operation. The first column represents the shape of the output as BatchCount:Channels:Height:Width, and the numbers in other columns represent the latency (in us, on average out of 100 runs) for the tested variants. Specifically, I compare the (sequential) baseline (in second column) with the (parallelized) variants, each using a number of worker threads equal to 1, 2, 4, 8 or 16 (as specified in [the constructor to the threadpool object](https://github.com/microsoft/onnxruntime/blob/9954454c65086c49b7c00f83b23ada76975f3546/onnxruntime/test/mlas/unittest/test_main.cpp#L12)). The numbers in () represent the speedup over the baseline. | Input | baseline | 1 Thread | 2 Threads | 4 Threads | 8 Threads | 16 Threads| | ------------- | ------------- |---------------|---------------|---------------|---------------|---------------| 1:1:112:112 | 20.8 | 21.5 (x0.97) | 21.9 (x0.95) | 22.2 (x0.94) | 22.5 (x0.92) | 23.0 (x0.90) | 1:128:160:84 | 540.4 | 712.5 (x0.76) | 404.0 (x1.34) | 327.8 (x1.65) | 377.9 (x1.43) | 371.8 (x1.45) | 13:240:4:314 | 1484.0 | 1851.1 (x0.80) | 1080.9 (x1.37) | 570.2 (x2.60) | 531.8 (x2.79) | 511.2 (x2.90) | 13:96:4:314 | 471.0 | 679.9 (x0.69) | 427.2 (x1.10) | 372.1 (x1.27) | 445.5 (x1.06) | 428.5 (x1.10) | 1:64:320:168 | 1215.1 | 1497.8 (x0.81) | 863.8 (x1.41) | 456.7 (x2.66) | 435.7 (x2.79) | 462.5 (x2.63) | 30:240:4:140 | 1711.5 | 2181.4 (x0.78) | 1182.6 (x1.45) | 657.4 (x2.60) | 592.5 (x2.89) | 578.0 (x2.96) | 30:336:4:140 | 2432.5 | 3039.2 (x0.80) | 1695.6 (x1.43) | 920.7 (x2.64) | 817.1 (x2.98) | 819.2 (x2.97) | The initial drop between the baseline and the variant using just one worker thread can be attributed to the overhead of invoking the reordering loop as a functor in TryBatchParallelFor. This overhead is compensated by the speedup of parallel processing when the number of worker threads is increased.
Author
Parents
Loading