[MLAS] Add kleidiai pad ptr invalidation test case (#27465)
### Description
This pr introduces some minor code changes which do the following:
- Fix copilot header include suggestions from
https://github.com/microsoft/onnxruntime/pull/27439
- Add testcase which covers code path fixed via
https://github.com/microsoft/onnxruntime/pull/27215 and test case
discussed in https://github.com/microsoft/onnxruntime/pull/27214
- Change pointer invalidation to cover only updated pointer in pad
structure
### Testing
This patch was tested in two ways.
1) After creating tests which I thought would trigger a previous failure
case I reverted the convolve_kleidiai.cpp code to before the initial fix
in [Hari's change](https://github.com/microsoft/onnxruntime/pull/27215)
for pad ptr was introduced. Added debug logging and tested for failures
to highlight the moving and invalidation of pointer. Example failure
below
2) I reintroduced the current code and then tested multiple times <br>
`for i in $(seq 1 2000); do echo "ITER=$i"; ./onnxruntime_mlas_test
--long --gtest_filter='*Conv2d*' || break; done`
### Explanation of Subsequent logs <br>
1) **Padding buffer relocation**
- `KLEIDIAI_CONV_LHS pad_buf MOVED ci=320 padsize=512 old=0x12e80d800
new=0x12e81ac00`
- Meaning: the internal zero padding buffer used for out-of-bounds
pixels was resized and the underlying storage address changed (`old` →
`new`). Any previously-built indirection table entries that pointed at
the old padding buffer are now stale.
2) **Reuse of cached indirection table after the move**
- `KLEIDIAI_CONV_LHS indirection_cache HIT ci=64 m=121 **pad=0x12e81ac00
old_pad=0x12e80d800 (after_pad_move)**`
- Meaning: for a later convolution (`ci=64`) the indirection-table cache
returned a HIT. The log prints the current pad buffer address
(`pad=...`) and the most recent prior padding-buffer address
(`old_pad=...`) captured during the move. The `(after_pad_move)` tag
indicates that this cache HIT occurred after a pad-buffer relocation
event, which is the dangerous case in the pre-fix implementation (cached
tables may still contain pointers to `old_pad`).
In failing runs, the output mismatch occurs immediately after this
sequence, showing a clear correlation: **pad buffer moved → cached
indirection table reused → incorrect results**.
* one note for the test is I commented out most of the rest of the
fixture in the changed file before running for time constraints on the
2000 runs
```
jonclo01$ ./onnxruntime_mlas_test --long --gtest_filter='*Conv2d*' clear
-------------------------------------------------------
----Total 3066 tests registered programmably!
-------------------------------------------------------
Note: Google Test filter = *Conv2d*
[==========] Running 2 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from Conv2d_SingleThread
[ RUN ] Conv2d_SingleThread.LongExecute
[KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 496 : KLEIDIAI_CONV_LHS pad_buf ci=64 padsize=256 addr=0x12e80d800
[KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 543 : KLEIDIAI_CONV_LHS indirection_cache MISS ci=64 m=121 pad=0x12e80d800
[KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 325 : kai_run_lhs_imatmul_pack_x32p2vlx1_x32p_sme M=121 k_chunk_count=9 k_chunk_length=64
[KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 376 : kai_run_rhs_imatmul_pack_kxn_x32p2vlx1b_x32_x32_sme N=32 k_chunk_count=9 k_chunk_length=64 rhs_stride_row=128
[KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 653 : kai_run_imatmul_clamp_f32_f32p2vlx1_f32p2vlx1b_2vlx2vl_sme2_mopa M=121 N=32 k_chunk_count=9 k_chunk_length=64
[KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp : 349 : kai_run_rhs_pack_kxn_f32p2vlx1biasf32_f32_f32_sme Groups=1 N=121 K=576 nr=32 kr=1 sr=1 rhs_stride_row=484
[KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 490 : KLEIDIAI_CONV_LHS **pad_buf MOVED ci=320 padsize=512 old=0x12e80d800 new=0x12e81ac00**
[KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 543 : KLEIDIAI_CONV_LHS indirection_cache MISS ci=320 m=121 pad=0x12e81ac00
[KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 325 : kai_run_lhs_imatmul_pack_x32p2vlx1_x32p_sme M=121 k_chunk_count=9 k_chunk_length=320
[KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 376 : kai_run_rhs_imatmul_pack_kxn_x32p2vlx1b_x32_x32_sme N=32 k_chunk_count=9 k_chunk_length=320 rhs_stride_row=128
[KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 653 : kai_run_imatmul_clamp_f32_f32p2vlx1_f32p2vlx1b_2vlx2vl_sme2_mopa M=121 N=32 k_chunk_count=9 k_chunk_length=320
[KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp : 349 : kai_run_rhs_pack_kxn_f32p2vlx1biasf32_f32_f32_sme Groups=1 N=121 K=2880 nr=32 kr=1 sr=1 rhs_stride_row=484
[KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 535 : KLEIDIAI_CONV_LHS indirection_cache HIT ci=64 m=121 **pad=0x12e81ac00 old_pad=0x12e80d800 (after_pad_move)**
[KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 325 : kai_run_lhs_imatmul_pack_x32p2vlx1_x32p_sme M=121 k_chunk_count=9 k_chunk_length=64
[KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 653 : kai_run_imatmul_clamp_f32_f32p2vlx1_f32p2vlx1b_2vlx2vl_sme2_mopa M=121 N=32 k_chunk_count=9 k_chunk_length=64
[KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp : 349 : kai_run_rhs_pack_kxn_f32p2vlx1biasf32_f32_f32_sme Groups=1 N=121 K=576 nr=32 kr=1 sr=1 rhs_stride_row=484
/Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/test/mlas/unittest/test_conv2d.h:249: Failure
Expected equality of these values:
memcmp(Output, OutputReference, OutputElements * sizeof(float))
Which is: 90
0
B1/G1/Cpg64/Fpg32/H11/W11/KH3/KW3/Pad1,1,1,1/Dilation1,1/Stride1,1
Stack trace:
0x10247ba34: MlasConv2DTest<>::ExecuteLong()
0x102651904: testing::internal::HandleExceptionsInMethodIfSupported<>()
0x1026517a4: testing::Test::Run()
0x102652b5c: testing::TestInfo::Run()
0x102653c84: testing::TestSuite::Run()
... Google Test internal frames ...
[ FAILED ] Conv2d_SingleThread.LongExecute, where GetParam() = LongExecute (10 ms)
```
---------
Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>