Perform QlinearConv for a batch in a single parallel (#14296)
### Description
This code change allows for the QlinearConv operator to sync batches
into a single parallel section. This allows for the tasks of all the
batches to be made available for threads to exercise. This would act
alternatively to the existing method which parallelizes the tasks of
induvial images separately which forces threads to wait for all an
entire image’s tasks to complete before continuing.
### Motivation and Context
For int8 convolution models where multiple batches are being utilized,
this patch delivers an inference improvement of up-to 41% and 39% for
Mobilenet_edtpu (U8S8) and Resnet50(U8S8) respectively on systems with
higher core counts. The patch, delivers the highest benefit on systems
with higher thread counts and when utilizing large batch sizes.
<html>
<body>
<!--StartFragment--><span style="color: rgb(201, 209, 217); font-family:
-apple-system, BlinkMacSystemFont, "Segoe UI", "Noto
Sans", Helvetica, Arial, sans-serif, "Apple Color Emoji",
"Segoe UI Emoji"; font-size: 14px; font-style: normal;
font-variant-ligatures: normal; font-variant-caps: normal; font-weight:
400; letter-spacing: normal; orphans: 2; text-align: start; text-indent:
0px; text-transform: none; white-space: normal; widows: 2; word-spacing:
0px; -webkit-text-stroke-width: 0px; background-color: rgb(13, 17, 23);
text-decoration-thickness: initial; text-decoration-style: initial;
text-decoration-color: initial; display: inline !important; float:
none;"><style> </style></span>
| | Batch 2 | Batch 4 | Batch 8 | Batch 16 | Batch 32 | Batch 64
-- | -- | -- | -- | -- | -- | -- | --
resnet50 | % Gain | 22% | 25% | 32% | 36% | 33% | 32%
<!--EndFragment-->
</body>
</html>