[webgpu] Restore FP16 math in flash attention generation (#24994)
This PR restores FP16 math in flash attention generation shader. It
follows the changes in #24953 to use scale to multiply Q first instead
of calculating it after QK to avoid data overflow in FP16.