llvm
01bc8297 - [CI][UR][CUDA] Remove NVML xfails after driver update + NVML version check for NVIDIA (new workflow) (#21625)

Commit
2 days ago
[CI][UR][CUDA] Remove NVML xfails after driver update + NVML version check for NVIDIA (new workflow) (#21625) **Remove xfails from 4 CUDA conformance tests that require NVML:** - SuccessThrottleReasons (UR_DEVICE_INFO_CURRENT_CLOCK_THROTTLE_REASONS) - SuccessFanSpeed (UR_DEVICE_INFO_FAN_SPEED) - SuccessMaxPowerLimit (UR_DEVICE_INFO_MAX_POWER_LIMIT) - SuccessMinPowerLimit (UR_DEVICE_INFO_MIN_POWER_LIMIT) These tests were failing with 'Driver/library version mismatch' due to incompatibility between libnvidia-ml.so in the container (550.144) and the NVIDIA driver on the CI host. **Add Early NVML Version Check in CI** New workflow step that validates compatibility before running tests: - Detects host driver version via nvidia-smi - Detects container NVML library version from libnvidia-ml.so.1 - Tests compatibility by running nvidia-smi from container - Fails fast with clear error message if versions are incompatible - Uses GitHub Actions error annotations for high visibility **NVML Version Compatibility Rules** Per NVIDIA NVML API documentation: - Major version must match: Driver 550.x requires libNVML 550.x - Library version ≤ Driver version: Library cannot be newer than driver - Different major versions always fail: Driver 550.x + libNVML 565.x = mismatch - Examples: ✅ Driver 550.90.07 + libNVML 550.90.07 (exact match) ✅ Driver 550.90.07 + libNVML 550.54.15 (older library minor version) ❌ Driver 550.90.07 + libNVML 565.57.01 (different major version) ❌ Driver 550.54.15 + libNVML 550.90.07 (newer library minor version) The check uses nvidia-smi to validate compatibility, which implements NVIDIA's official version checking logic. --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Author
Parents
Loading