[CUDA] Support SwiGlu in MoE and qMoE (#25530)
### Description
This implements the SwiGLU activation for MoE and qMoE. The activation
is corresponding to
https://github.com/triton-lang/triton/blob/main/python/triton_kernels/triton_kernels/swiglu.py.
Also update test_parity_moe.py to enable test for qMoE in CI pipelines.
### Motivation and Context
This is naive implementation of the activation. Since the activation
will reduce each row length to half, we cannot directly use epilogue.
Current implementations need an extra buffer to run SwiGLU kernel.
In the future, we might take a look at other alternatives that does not
need extra buffer.