[webgpu] Optimize FlashAttention for prefill (#25395)
### Description
This PR enhances unidirectional `FlashAttention` by applying causal
masking inside the main loop. This optimization eliminates unnecessary
memory loads by avoiding future entries in the KV cache.
Testing on Lunar Lake shows up to a 20% performance improvement for
`phi-4-mini-accuracy4` (with a prompt of 4096). Similar performance
gains were also observed for other models, including
`Qwen3-0.6B-accuracy4`.
This PR now uses the more readable `unidirectional` attribute instead of
`is_gqa`, to control causal masking.
### Motivation and Context
See above.