Update MoE and qMoE spec (#25619)
### Weight Shape Update
Make sure the shape reflects actual memory layout. The weight is stored
in column major.
### Add support for SwiGLU activation attributes
Add spec for the new activation type SwiGLU (Swish-Gated Linear Unit) by
introducing a few new attributes. For reference, see the [Triton kernel
implementation](https://github.com/triton-lang/triton/blob/main/python/triton_kernels/triton_kernels/swiglu.py).
#### New Attributes for SwiGLU
* **`swiglu_fusion`**:
* `0`: Not fused — two separate GEMMs (FC1 and FC3).
* `1`: Fused GEMMs using **interleaved** format (g and l are interleaved
per row).
* `2`: Fused GEMMs using **non-interleaved** (concatenated) format.
* **`swiglu_limit`**: Clamp threshold applied to `g` and `l`.
* **`activation_alpha`**: Scalar multiplier applied to `g` before
sigmoid.
* **`activation_beta`**: Added to `l` before the final output
computation.
---
### SwiGLU Activation Function
The SwiGLU function is defined as:
```
g = xW + b
l = xV + c
G = min(g, limit)
L = max(min(l, limit), -limit)
swiglu = G * sigmoid(alpha * G) * (L + beta)
```
* `x`: Input
* `W`, `V`: Weight matrices
* `b`, `c`: Bias vectors
* `alpha`, `beta`, `limit`: Float constants
---
### Fusion Behavior
* When `swiglu_fusion = 0`:
* Two GEMMs are computed independently.
* FC1 → computes `g`, FC3 → computes `l`.
* When `swiglu_fusion = 1`:
* `g` and `l` are computed in a **single fused GEMM** (FC1).
* Output is **interleaved** per row as: `gate, linear, gate, linear,
...`.
* When `swiglu_fusion = 2`:
* `g` and `l` are computed in a single GEMM (FC1).
* Output is **concatenated** per row: `[g | l]`.
### Implement swiglu_limit for CUDA
Update CUDA kernel to use default swiglu limit.
Update test_moe_cuda.py to have same logic in reference implementation.
### Remaining Works
The main purpose of this PR is to update spec instead of implementing
them.
Note that MoE/qMoE ops and tests still use hard-coded parameters and
will be changed later to read from those attributes.
Column-wise symmetric quantization is used for qMoE. We will add more
quantization details when we add support of block-wise quantization
soon.