[WebGPU-EP] Optimize subgroup_matrix_matmul_nbits on Intel (#25140)
This PR optimizes the Intel path for subgroup_matrix_matmul_nbits by
removing the per-thread load of matrix A and instead using
subgroupMatrixLoad directly from global memory, reducing SLM usage and
bandwidth pressure.
- Removed var<workgroup> tile_A and the loadSHMA helper function.
- Updated inner loop to compute a global offset and call
subgroupMatrixLoad on input_a.
- Adjusted indexing and stride parameters to match the global layout.