onnxruntime
69feb84c - Add PCI bus fallback for Linux GPU device discovery in containerized environments (#27591)

Commit
44 days ago
Add PCI bus fallback for Linux GPU device discovery in containerized environments (#27591) ### Description GPU device discovery on Linux relies exclusively on `/sys/class/drm/cardN` entries (DRM subsystem). In AKS/Kubernetes containers, `nvidia-drm` is typically not loaded—only the base NVIDIA driver is needed for CUDA compute. No DRM entries means no `OrtHardwareDevice` with `OrtHardwareDeviceType_GPU` is created, so `GetEpDevices` never matches the CUDA EP. Adds a fallback path in `GetGpuDevices()` that scans `/sys/bus/pci/devices/` when DRM yields zero GPUs: - **`DetectGpuPciPaths()`** — enumerates PCI devices, filters by class code `0x0300` (VGA) and `0x0302` (3D controller, used by NVIDIA datacenter GPUs) per the [PCI Code and ID Assignment Specification](https://pcisig.com/pci-code-and-id-assignment-specification-agreement) (base class 03h). Accepts an injectable sysfs root path for testability. - **`GetGpuDeviceFromPci()`** — reads `vendor`/`device` files directly from the PCI device sysfs path and populates `OrtHardwareDevice` with `pci_bus_id` and discrete GPU metadata. Note: `card_idx` is intentionally omitted from PCI-discovered devices since `directory_iterator` traversal order is unspecified and cannot be made consistent with DRM's `cardN` ordering. - **`GetGpuDevices()`** — tries DRM first; if empty, falls back to PCI scan The PCI detection functions are exposed via a new `onnxruntime::pci_device_discovery` namespace (declared in `core/platform/linux/pci_device_discovery.h`) so they can be tested hermetically with fake sysfs directories. The fallback only activates when DRM finds nothing, so no behavioral change on systems where DRM works. Also adds: - A cross-platform `GpuDevicesHaveValidProperties` test that validates GPU device type and vendor ID when GPUs are present. The test intentionally does not assert on `device_id` since some platforms (e.g., Apple Silicon) do not populate it. - Comprehensive hermetic Linux unit tests (`test/platform/linux/pci_device_discovery_test.cc`) that create fake sysfs directory structures to exercise the PCI fallback path, covering VGA/3D controller detection, non-GPU filtering, empty/missing paths, multiple GPUs, vendor/device ID reading, and NVIDIA discrete metadata. Tests use the `ASSERT_STATUS_OK()` macro from `test/util/include/asserts.h` and use `CreateFakePciDevice` to set up complete fake PCI device directories for both `DetectGpuPciPaths` and `GetGpuDeviceFromPci` tests. ### Motivation and Context CUDA EP registration fails on AKS (Azure Kubernetes Service) because the NVIDIA device plugin exposes GPUs via `/dev/nvidia*` and the NVIDIA driver, but does not load `nvidia-drm`. The existing `/sys/class/drm`-only detection path returns no GPU devices, blocking `GetEpDevices` from returning the CUDA EP. The same setup works on bare-metal Linux where DRM is loaded. <!-- START COPILOT CODING AGENT TIPS --> --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com> Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Author
Parents
Loading