[CI][UR][CUDA] Remove NVML xfails after driver update + NVML version check for NVIDIA (new workflow) (#21625)
**Remove xfails from 4 CUDA conformance tests that require NVML:**
- SuccessThrottleReasons (UR_DEVICE_INFO_CURRENT_CLOCK_THROTTLE_REASONS)
- SuccessFanSpeed (UR_DEVICE_INFO_FAN_SPEED)
- SuccessMaxPowerLimit (UR_DEVICE_INFO_MAX_POWER_LIMIT)
- SuccessMinPowerLimit (UR_DEVICE_INFO_MIN_POWER_LIMIT)
These tests were failing with 'Driver/library version mismatch' due to
incompatibility between libnvidia-ml.so in the container (550.144) and
the NVIDIA driver on the CI host.
**Add Early NVML Version Check in CI**
New workflow step that validates compatibility before running tests:
- Detects host driver version via nvidia-smi
- Detects container NVML library version from libnvidia-ml.so.1
- Tests compatibility by running nvidia-smi from container
- Fails fast with clear error message if versions are incompatible
- Uses GitHub Actions error annotations for high visibility
**NVML Version Compatibility Rules**
Per NVIDIA NVML API documentation:
- Major version must match: Driver 550.x requires libNVML 550.x
- Library version ≤ Driver version: Library cannot be newer than driver
- Different major versions always fail: Driver 550.x + libNVML 565.x =
mismatch
- Examples:
✅ Driver 550.90.07 + libNVML 550.90.07 (exact match)
✅ Driver 550.90.07 + libNVML 550.54.15 (older library minor version)
❌ Driver 550.90.07 + libNVML 565.57.01 (different major version)
❌ Driver 550.54.15 + libNVML 550.90.07 (newer library minor version)
The check uses nvidia-smi to validate compatibility, which implements
NVIDIA's official version checking logic.
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>