[ARM CPU] Enable FP16 kernels for GQA op (#23746)
### Description
- Enable hgemm and softmax fp16 kernels for GQA
- add intra-loop parallelism to RoPE fp16 kernel
__Benchmarking models__
- float32: [phi-3 cpu accuracy level
0](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/tree/main/cpu_and_mobile/cpu-int4-rtn-block-32)
- float16: [phi-3 gpu accuracy level
0](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/tree/main/cuda/cuda-int4-rtn-block-32)
Note:
- Both fp32 and fp16 models share the same model structure and operator
settings.
- GQA takes ~15% of the runtime.
- prompt length 256, token generation length 512
Linux (ubuntu 24.04) Standard D16pls v5 (16 vcpus, 32 GiB memory)
| | fp32 (tps) | old fp16 (tps) | new fp16 (tps) | new fp16 vs old fp16
| new fp16 vs fp32 |
|--|--|--|--|--|--|
| prompt processing | 31.22 | 44.24 | 46.29 | +4.6% | +48.25% |
| token generation | 4.75 | 7.2 | 7.95 | +10.39% | +67.43% |
### Motivation and Context
Speed up GQA on FP16