Fix attention QK linkage error (#24134)
### Description
This PR moves the CUDA memcpy for the QK output when type `T` is equal
to type `QK` from `attention_impl.cu` into `attention_qk.cu`.
### Motivation and Context
This PR fixes a linkage error when type `T` and type `QK` are the same
in `attention_qk.cu`.