onnxruntime
a1fc9165 - Additional diagnostics for DML failure path (#28495)

Commit
5 days ago
Additional diagnostics for DML failure path (#28495) ### Description <!-- Describe your changes. --> In DmlGraphFusionHelper::ExecuteReusableCommandList, after ExecuteCommandList fails: * Broaden the failure branch from just DXGI_ERROR_DEVICE_REMOVED to also catch DEVICE_HUNG, DEVICE_RESET, and DRIVER_INTERNAL_ERROR. * Query GetDeviceRemovedReason on both the DML and D3D12 devices (matching the pattern in DmlCommandRecorder.cpp). * Throw via ORT_THROW_HR_MSG with a clear message that names the failure as a TDR / device-removal event, calls out and includes all three HRESULTs for triage. Preserves the prior thrown-HRESULT for the existing DEVICE_REMOVED path ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> While investigating a WebNN sample failure on Chrome running Stable Diffusion 1.5 on an AMD Radeon 860M iGPU, ORT 1.23.4 surfaced this error: `DmlGraphFusionHelper.cpp(1078) ... 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.` 0x887A0006 is DXGI_ERROR_DEVICE_HUNG. The text "...invalid command passed by the calling application" seems to be the FormatMessage string for that HRESULT. The pre-existing code in DmlGraphFusionHelper::ExecuteReusableCommandList only special-cased DXGI_ERROR_DEVICE_REMOVED, so for DEVICE_HUNG / DEVICE_RESET / DRIVER_INTERNAL_ERROR HRESULTs the user just got the raw message. I wanted to add a little more diagnostic information to this. Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
Author
Parents
Loading