Add PCI bus fallback for Linux GPU device discovery in containerized environments (#27591)
### Description
GPU device discovery on Linux relies exclusively on
`/sys/class/drm/cardN` entries (DRM subsystem). In AKS/Kubernetes
containers, `nvidia-drm` is typically not loaded—only the base NVIDIA
driver is needed for CUDA compute. No DRM entries means no
`OrtHardwareDevice` with `OrtHardwareDeviceType_GPU` is created, so
`GetEpDevices` never matches the CUDA EP.
Adds a fallback path in `GetGpuDevices()` that scans
`/sys/bus/pci/devices/` when DRM yields zero GPUs:
- **`DetectGpuPciPaths()`** — enumerates PCI devices, filters by class
code `0x0300` (VGA) and `0x0302` (3D controller, used by NVIDIA
datacenter GPUs) per the [PCI Code and ID Assignment
Specification](https://pcisig.com/pci-code-and-id-assignment-specification-agreement)
(base class 03h). Accepts an injectable sysfs root path for testability.
- **`GetGpuDeviceFromPci()`** — reads `vendor`/`device` files directly
from the PCI device sysfs path and populates `OrtHardwareDevice` with
`pci_bus_id` and discrete GPU metadata. Note: `card_idx` is
intentionally omitted from PCI-discovered devices since
`directory_iterator` traversal order is unspecified and cannot be made
consistent with DRM's `cardN` ordering.
- **`GetGpuDevices()`** — tries DRM first; if empty, falls back to PCI
scan
The PCI detection functions are exposed via a new
`onnxruntime::pci_device_discovery` namespace (declared in
`core/platform/linux/pci_device_discovery.h`) so they can be tested
hermetically with fake sysfs directories.
The fallback only activates when DRM finds nothing, so no behavioral
change on systems where DRM works.
Also adds:
- A cross-platform `GpuDevicesHaveValidProperties` test that validates
GPU device type and vendor ID when GPUs are present. The test
intentionally does not assert on `device_id` since some platforms (e.g.,
Apple Silicon) do not populate it.
- Comprehensive hermetic Linux unit tests
(`test/platform/linux/pci_device_discovery_test.cc`) that create fake
sysfs directory structures to exercise the PCI fallback path, covering
VGA/3D controller detection, non-GPU filtering, empty/missing paths,
multiple GPUs, vendor/device ID reading, and NVIDIA discrete metadata.
Tests use the `ASSERT_STATUS_OK()` macro from
`test/util/include/asserts.h` and use `CreateFakePciDevice` to set up
complete fake PCI device directories for both `DetectGpuPciPaths` and
`GetGpuDeviceFromPci` tests.
### Motivation and Context
CUDA EP registration fails on AKS (Azure Kubernetes Service) because the
NVIDIA device plugin exposes GPUs via `/dev/nvidia*` and the NVIDIA
driver, but does not load `nvidia-drm`. The existing
`/sys/class/drm`-only detection path returns no GPU devices, blocking
`GetEpDevices` from returning the CUDA EP. The same setup works on
bare-metal Linux where DRM is loaded.
<!-- START COPILOT CODING AGENT TIPS -->
---
💬 We'd love your input! Share your thoughts on Copilot coding agent in
our [2 minute survey](https://gh.io/copilot-coding-agent-survey).
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com>
Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>