[webgpu] Implement SubGroupMatrix based MatMulNBits for Metal (#23729)
### Description
Recent progress with SubGroupMatrix prototype in Dawn
https://issues.chromium.org/issues/348702031, exposes SIMD-Group Matrix
Functions to webgpu. This shader implements a matmulnbits using that
primitive.
Observed perf gains, in terms of LLM inference speed, prefill perf for
Phi 3.5 for a 1K token prefill see 3x improvement. 5.4s from 15s.
With Changes
```
./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
avg (us): 5.42498e+06 <<< SubGroupMatrix 5.4s
avg (tokens/s): 184.517
p50 (us): 5.41982e+06
stddev (us): 12023.8
n: 5 * 1001 token(s)
Token generation:
avg (us): 91138.5
avg (tokens/s): 10.9723
p50 (us): 89488.5
stddev (us): 35136.2
n: 635 * 1 token(s)
```
Baseline
```
./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
avg (us): 1.45507e+07 <<< Baseline 14.5s
avg (tokens/s): 68.7938
p50 (us): 1.45413e+07
stddev (us): 22208.9
n: 5 * 1001 token(s)
Token generation:
avg (us): 94109.8
avg (tokens/s): 10.6259
p50 (us): 89660
stddev (us): 61579
n: 635 * 1 token(s)
```