[CUDA] MatMulNBits benchmark (#24564)
### Description
1. Add benchmark script for MatMulNBits.
2. Update kernel based on benchmark results:
- Change kernel back to handle m=1
- Use simple loop kernel instead of unrolling
- Change partial sum to float type to trade-off precision and
performance (less precision loss, no obvious performance drop)
Example output of benchmark:
```
------------------------------------------------------------------------------------------------------------------------
Benchmarking MatMulNBits on NVIDIA A100-SXM4-80GB (Compute Capability: 8.0)
------------------------------------------------------------------------------------------------------------------------
CUDA Graph | M | N | K | Bits | Block Size | Threads | Latency (us) | StdDev (us) | TFLOPS
------------------------------------------------------------------------------------------------------------------------
True | 1 | 3072 | 8192 | 4 | 32 | 0 | 95.7 | 5.7 | 0.526
True | 1 | 3072 | 8192 | 8 | 32 | 0 | 110.7 | 81.1 | 0.454
True | 1 | 3072 | 8192 | 4 | 128 | 0 | 93.7 | 41.2 | 0.537
True | 1 | 3072 | 8192 | 8 | 128 | 0 | 105.0 | 129.3 | 0.479
True | 1 | 5120 | 3072 | 4 | 32 | 0 | 86.7 | 49.9 | 0.363
True | 1 | 5120 | 3072 | 8 | 32 | 0 | 90.1 | 41.1 | 0.349
True | 1 | 5120 | 3072 | 4 | 128 | 0 | 83.9 | 46.7 | 0.375
True | 1 | 5120 | 3072 | 8 | 128 | 0 | 85.2 | 57.1 | 0.369
True | 1 | 8192 | 3072 | 4 | 32 | 0 | 107.3 | 29.2 | 0.469
True | 1 | 8192 | 3072 | 8 | 32 | 0 | 102.3 | 57.1 | 0.492
True | 1 | 8192 | 3072 | 4 | 128 | 0 | 99.2 | 61.2 | 0.507
True | 1 | 8192 | 3072 | 8 | 128 | 0 | 97.5 | 47.4 | 0.516
True | 1 | 200064 | 3072 | 4 | 32 | 0 | 1456.4 | 11.0 | 0.844
True | 1 | 200064 | 3072 | 8 | 32 | 0 | 1336.4 | 10.3 | 0.920
True | 1 | 200064 | 3072 | 4 | 128 | 0 | 1261.6 | 16.6 | 0.974
True | 1 | 200064 | 3072 | 8 | 128 | 0 | 1232.6 | 17.9 | 0.997
True | 256 | 3072 | 8192 | 4 | 32 | 0 | 211.1 | 5.8 | 61.030
True | 256 | 3072 | 8192 | 8 | 32 | 0 | 217.8 | 62.8 | 59.154
True | 256 | 3072 | 8192 | 4 | 128 | 0 | 208.7 | 63.3 | 61.751
True | 256 | 3072 | 8192 | 8 | 128 | 0 | 213.0 | 58.2 | 60.491
True | 256 | 5120 | 3072 | 4 | 32 | 0 | 151.9 | 57.4 | 53.028
True | 256 | 5120 | 3072 | 8 | 32 | 0 | 156.2 | 71.1 | 51.554
True | 256 | 5120 | 3072 | 4 | 128 | 0 | 151.4 | 22.6 | 53.198
True | 256 | 5120 | 3072 | 8 | 128 | 0 | 154.6 | 47.1 | 52.092
True | 256 | 8192 | 3072 | 4 | 32 | 0 | 219.0 | 4.4 | 58.847
True | 256 | 8192 | 3072 | 8 | 32 | 0 | 226.6 | 14.5 | 56.860
True | 256 | 8192 | 3072 | 4 | 128 | 0 | 206.7 | 39.9 | 62.333
True | 256 | 8192 | 3072 | 8 | 128 | 0 | 216.2 | 41.3 | 59.587
True | 256 | 200064 | 3072 | 4 | 32 | 0 | 3110.9 | 11.3 | 101.152
True | 256 | 200064 | 3072 | 8 | 32 | 0 | 3290.9 | 8.3 | 95.619
True | 256 | 200064 | 3072 | 4 | 128 | 0 | 3055.2 | 10.2 | 102.995
True | 256 | 200064 | 3072 | 8 | 128 | 0 | 3220.4 | 9.8 | 97.712
True | 1024 | 3072 | 8192 | 4 | 32 | 0 | 363.6 | 40.2 | 141.754
True | 1024 | 3072 | 8192 | 8 | 32 | 0 | 369.0 | 46.0 | 139.669
True | 1024 | 3072 | 8192 | 4 | 128 | 0 | 362.8 | 55.6 | 142.052
True | 1024 | 3072 | 8192 | 8 | 128 | 0 | 367.5 | 56.5 | 140.256
True | 1024 | 5120 | 3072 | 4 | 32 | 0 | 221.6 | 58.1 | 145.383
True | 1024 | 5120 | 3072 | 8 | 32 | 0 | 225.4 | 56.6 | 142.938
True | 1024 | 5120 | 3072 | 4 | 128 | 0 | 220.2 | 36.9 | 146.306
True | 1024 | 5120 | 3072 | 8 | 128 | 0 | 224.1 | 57.8 | 143.751
True | 1024 | 8192 | 3072 | 4 | 32 | 0 | 346.2 | 41.8 | 148.854
True | 1024 | 8192 | 3072 | 8 | 32 | 0 | 352.8 | 21.6 | 146.097
True | 1024 | 8192 | 3072 | 4 | 128 | 0 | 344.5 | 18.9 | 149.627
True | 1024 | 8192 | 3072 | 8 | 128 | 0 | 350.6 | 10.6 | 147.016
True | 1024 | 200064 | 3072 | 4 | 32 | 0 | 6822.0 | 44.1 | 184.504
True | 1024 | 200064 | 3072 | 8 | 32 | 0 | 7018.5 | 38.4 | 179.339
True | 1024 | 200064 | 3072 | 4 | 128 | 0 | 6757.8 | 51.5 | 186.257
True | 1024 | 200064 | 3072 | 8 | 128 | 0 | 6947.7 | 38.1 | 181.167
------------------------------------------------------------------------------------------------------------------------
```
### Motivation and Context
Follow up with https://github.com/microsoft/onnxruntime/pull/24509