[MLAS] Removed memcpy step by storing result in C if possible (#27367)
<h2 data-sourcepos="1:1-1:10" dir="auto">Summary</h2>
<p data-sourcepos="2:1-2:68" dir="auto">This change removes the memcpy
step in sgemm_kleidiai where possible by writing directly to C</p>
<h2 data-sourcepos="4:1-4:10" dir="auto">
<a href="#testing" aria-hidden="true" class="anchor"
id="user-content-testing"></a>Testing</h2>
Model | Baseline avg (ms) | Current avg (ms) | Δ ms | Δ %
-- | -- | -- | -- | --
Transformer_complex_f32.onnx | 2.929885 | 2.701083 | -0.228802 | -7.81%
bert_tiny_f32.onnx | 0.279675 | 0.273928 | -0.005747 | -2.05%
de_efficientnetlitev3_f32.onnx | 80.038132 | 78.560747 | -1.477385 |
-1.85%
deeplabv3_mobilenetv2_f32.onnx | 48.565125 | 46.446841 | -2.118284 |
-4.36%
imagetransformnet_f32.onnx | 303.835868 | 302.553625 | -1.282243 |
-0.42%
mobilenet_v1_f32.onnx | 4.379468 | 4.163018 | -0.216450 | -4.94%
mobilenetv1_ssd_f32.onnx | 9.245055 | 8.881198 | -0.363857 | -3.94%
openposev2_vgg19_f32.onnx | 210.981128 | 209.199398 | -1.781730 | -0.84%
retinaface_f32.onnx | 42.326391 | 38.454346 | -3.872045 | -9.15%
rfdn_f32.onnx | 13.929565 | 13.679875 | -0.249690 | -1.79%
Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>