[webgpu] Use workgroup memory to reduce register pressure (#24286)
On Qualcomm Adreno X1 GPUs, the previous implementation of the
FlashAttentionProgram shader in the WebGPU backend was causing high
register pressure, leading to performance degradation. This PR uses
workgroup memory to reduce the register pressure and improve
performance.
TTFT for phi4 with 1K inputs becomes 10s from 40s on Qualcomm Adreno X1
GPU.