ggml-cpu: FA add GEMM microkernel (#19422)
* ggml-cpu: FA add GEMM microkernel
* add guard for sizeless vector types
* fix case where DV % GGML_F32_EPR !=0
* move memset out of the loop
* move another memset out of the loop
* use RM=4 for arm
* simd_gemm: convert everything to int
* convert everything to size_t to avoid warnings
* fixup
* add pragma for ignoring aggressive loop optimizations