[MLAS] Simplify & optimize Arm64 NCHWc Convolution kernels (#26691)
### Description
This PR makes the following changes:
1. Reroutes Pointwise Convolution to use GEMM, as it is essentially the
same. This unlocks the performance benefits of GEMM too, as seen in the
performance section below.
2. Eliminate some branches in the code. This is achieved by using
in-built MLAS functions like `MlasBlendFloat32x4`
3. Expand the unit test coverage of NCHWc Conv kernels to catch edge
cases.
### Performance
This speeds up any Conv model that uses the pointwise kernel.
For example, Mobilenet inference speeds up from 500 inf/sec to 550
inf/sec.
### Testing
- Build passed: `./build.sh --config=Release --build_shared_lib
--parallel --cmake_extra_defines onnxruntime_USE_ARM_NEON_NCHWC=ON`
- Unit tests passed: `./build/Linux/Release/onnxruntime_mlas_test
--gtest_filter=Conv2dNchwc_*`
- Perf: `./build/Linux/Release/onnxruntime_perf_test -x 32 -I -m times
-r 2000 ~/scripts/mobilenet.onnx`
Happy to run additional perf tests as required.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>