reduce GQA test combinations (#22918)
### Description
* Reduce GQA test combinations to save about 35 minutes test time in CI
pipelines.
* Show latency of transformers tests
* Use seed in DMMHA test to avoid random failure.
* For test_flash_attn_rocm.py, test skipping condition from "has cuda
ep" to "not has rocm ep", so that it does not run in cpu build.
* For test_flash_attn_cuda.py, move flash attention and memory efficient
attention tests to different classes, so that we can skip a test suite
instead of checking in each test.
### Motivation and Context
It takes too long to run GQA tests in CI pipelines since there are too
many combinations.
###### Linux GPU CI Pipeline
Before: 5097 passed, 68 skipped, 8 warnings in 1954.64s (0:32:34)
After: 150 passed, 176 skipped, 8 warnings in 530.38s (0:08:50)
Time Saved: **1424** seconds (0:23:44)
###### Windows GPU CUDA CI Pipeline
Before: 1781 passed, 72 skipped, 6 warnings in 605.48s (0:10:05)
After: 116 passed, 118 skipped, 6 warnings in 275.48s (0:04:35)
Time Saved: **330** seconds (0:05:30)
###### Linux CPU CI Pipeline
Before: 5093 passed, 72 skipped, 4 warnings in 467.04s (0:07:47)
- 212.96s transformers/test_gqa_cpu.py::TestGQA::test_gqa_past
- 154.12s transformers/test_gqa_cpu.py::TestGQA::test_gqa_no_past
- 26.45s
transformers/test_gqa_cpu.py::TestGQA::test_gqa_interactive_one_batch
After: 116 passed, 210 skipped, 4 warnings in 93.41s (0:01:33)
- 0.97s transformers/test_gqa_cpu.py::TestGQA::test_gqa_past
- 19.23s transformers/test_gqa_cpu.py::TestGQA::test_gqa_no_past
- 2.41s
transformers/test_gqa_cpu.py::TestGQA::test_gqa_interactive_one_batch
Time Saved: **374** seconds (0:06:14).