Add EXAONE 4.0 model support for Inference V2 (#7853)
## Summary
Add support for LG AI Research's EXAONE 4.0 model family in DeepSpeed
Inference V2.
Closes #7453
## Changes
- New model implementation:
`deepspeed/inference/v2/model_implementations/exaone4/`
- `container.py`: Transformer and non-transformer parameter containers
- `model.py`: Inference model with post-norm architecture and QK-Norm
support
- `policy.py`: Inference V2 policy
- Register EXAONE 4.0 in `engine_factory.py` and `__init__.py`
## Key architectural differences from Mistral/Llama
- **Post-norm**: RMSNorm is applied after attention/MLP outputs (not
before), followed by residual addition
- **QK-Norm**: Per-head RMSNorm applied to Q and K projections after the
QKV linear layer
- **Hybrid attention**: 32B model uses 3:1 sliding window/full attention
ratio (via `layer_types` config)
## Supported models
- [EXAONE-4.0-1.2B](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-1.2B)
(all full attention)
- [EXAONE-4.0-32B](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B)
(hybrid sliding/full attention)
Requires `transformers >= 4.54.0`.
## Related
- Supersedes #7456 (draft, inactive for 6 months)
---------
Signed-off-by: Bias92 <pewpewplay315@gmail.com>