Add aarch64 wheel build to CUDA 13 Python packaging pipelines (#27760)
### Description
Adds aarch64 Linux wheel builds to the CUDA GPU packaging pipeline,
mirroring the existing x86_64 configuration.
- **`stages/py-linux-gpu-stage.yml`**: Add `hostArchitecture: Arm64` to
pool config when `arch == 'aarch64'` (matches pattern in `py-linux.yml`)
- **`stages/py-gpu-packaging-stage.yml`**: Add
`docker_base_image_aarch64` and `AArch64LinuxPythonConfigurations`
parameters (defaults to `[]` so CUDA 12 pipeline is unaffected), aarch64
build stages, and merge artifact dependencies/downloads
- **`py-cuda13-packaging-pipeline.yml`**: Pass aarch64 base image and
Python configs for all supported versions (3.11–3.14, including
free-threaded)
- **`aarch64/python/cuda/Dockerfile`** +
**`scripts/install_centos.sh`**: New Docker build context for aarch64
CUDA builds. It is different from x86_64 variant: aarch64 uses tar to
install tensorrt.
### Motivation and Context
`onnxruntime-gpu` only ships x86_64 and Windows wheels. Installing on
`manylinux_2_39_aarch64` (e.g. `ubuntu-24.04-arm` runners) fails with no
compatible wheel available.
- Fixes microsoft/onnxruntime#27005
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>