[webgpu] Optimize dp4 prefill shader for Qualcomm (#25578)
This change uses subgroupShuffle for sg_size=64 to perform the matmul.
It also uses a loop instead of loop unrolling to reduce the register
pressure.
Phi4 prefill for 1K tokens becomes 8.8s from 11.32s on Qualcomm Adreno
X1-85 GPU.