[CUDA] fp16 intB gemm (#24854)
### Description
* Add fpA intB gemm kernel from WeightOnlyGroupwiseQuantGemmPlugin of
TensorRT-LLM.
* Add prepacking to convert weight/scales/zero_points to adapt
MatMulNBits to use the kernel.
Limitations:
* Only enable fp16 kernel. BF16 support will be added later.
* Requires zero points. The support of scales only might be added later.
* Bias is not enabled since previous MatMulNBits kernel does not support
bias.
### Motivation and Context
To improve performance of LLM.
Initial result shows 2.2x throughput on prompt processing and 1.25X
throughput on token generation using onnxruntime-genai benchmark_e2e.py
on phi-4-mini-instruct on A100.