[webgpu] Use 64 as the workgroup size of DP4AMatMulQuantize (#24129)
Usually, workgroup size 1 is not a good option for compute shader. It
means that only one thread is active in one workgroup. This PR uses 64
as the workgroup size of DP4AMatMulQuantize.
On Qualcomm Adreno x1-85 GPU: 721.13 ms -> 148.38 ms
On NV RTX 2000 Ada: 87.66 ms -> 14.51 ms
On Intel Xe GPU: 76.30 ms -> 42.96 ms