Add GQA fusion for CUDA EP (#24335)
### Description
<!-- Describe your changes. -->
Most models can benefit from fusing the pre-GQA nodes into a single
MatMul or MatMulNBits. This change will detect the patterns possible to
fuse and execute the fusion on CUDA EPs.
### Motivation and Context
This will enable publishing of a single GPU model going forward.
---------
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>