QlinearConv threading adjustments (#11228)
* Reserve the first core for the main thread
Currently in "auto affinity" mode the worker threads are affinized to cores 0..(N-1), leaving the very last core for the main thread. This patch preserves core #0 for the main thread, and affinizes the worker threads to cores 1..N.
* Avoid unneeded spin_pause in thread pool's worker threads
Remove unneeded PAUSE instruction (0.1-0.2 usec latency) after a worker thread finds a task to execute.
* MLAS/x86: optimize QLinearConv on hybrid CPUs
Existing 4x task granularity for task partitioning on hybrid CPUs is
not sufficient to compensate the difference of VNNI instructions
throughput
between performance and efficient cores. This patch...
* Increases granularity for QLinearConv by 2x, to have 2x more tasks
with 2x
smaller output count
* Limits QLinearConv task count from above, to avoid output count per
task
getting smaller than kernel's capability
* Remove hardcoded task count for QLineConv as it limited scaling on
16+ cores CPUs
* MLAS/x86: optimize QLinearConv on hybrid CPUs
Existing 4x task granularity for task partitioning on hybrid CPUs is not sufficient to compensate the difference of VNNI instructions
throughput between performance and efficient cores. This patch...
* Increases granularity for QLinearConv by 2x, to have 2x more tasks
with 2x smaller output count
* Limits QLinearConv task count from above, to avoid output count per
task getting smaller than kernel's capability
* Remove hardcoded task count for QLineConv as it limited scaling on
16+ cores CP
* Addressing comments
* combining x86 ARM branches in qlinearconv threaded job partition
* revert first core assignment
Co-authored-by: Saurabh <saurabh.tangri@intel.com>
Co-authored-by: Chen Fu <fuchen@microsoft.com>