[CUDA] Support slide window in cutlass fused attention (#24072)
### Description
Add slide window support in cutlass fused attention
### Motivation and Context
The change was previously created by Ye:
https://github.com/microsoft/onnxruntime/pull/21926
I merged the change and resolved some conflictions. Also reversed some
Ye's change in kernel_forward.h, so that our code is consistent with
pytorch code.