Reduce GQA cpu test combinations (#26897)
The current testing strategy for GQA on CPU attempts to run a Cartesian
product of all configuration parameters (batch size, sequence length,
rotary embeddings, packed KV, softcap, etc.), leading to over 2000 test
combinations. This causes significant runtime overhead and potential
timeouts.
This PR optimizes `test_gqa_cpu.py` by:
- Replacing the nested loop over all parameters with a round-robin
selection strategy (`combo_index`).
- Significantly reducing the number of test cases (from ~2304 to ~32 in
pipeline mode) while maintaining coverage of individual features
(rotary, packed, softcap, etc.).
This ensures the test suite remains robust but much faster.
It reduces test time from minutes to seconds, and saves lot of compute
resource in CI pipeline.