[CUDA] update test_flash_attn_cuda.py for Windows (#21006)
Currently test_flash_attn_cuda.py can only run in Linux. It is because
it uses triton for rotary reference implementation, and triton python
package is not available in Windows.
This changes the script to allow the test run in Windows, so that we can
test memory efficient attention in Windows.
Due to limitation, rotary is excluded in testing on Windows.