[webgpu] Don't use num_workgroups when use indirect dispatch (#26334)
This pull request updates the FlashAttention WebGPU implementation to
improve support for indirect dispatch. The main changes ensure that when
indirect dispatch is used, the shader receives the actual workgroup
dimensions from an input buffer rather than relying on built-in
variables, which avoids duplication overhead in Dawn/WebGPU. See
https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/dawn/native/ComputePassEncoder.cpp;l=275.
This PR fixes the issue that indirect dispatch is slower than normal
dispatch for the same program.
With this change, the phi4 with graph capture enabled can run 145 tps
from 125 tps on NV 5080.