llvm-project
84118865 - [mlir][SYCL] Fail init errors cleanly instead of `abort`ing (#192979)

Commit
14 days ago
[mlir][SYCL] Fail init errors cleanly instead of `abort`ing (#192979) Disclaimer: this is my first PR to LLVM. I'm trying to follow the contribution guide and the conventions i see in other PRs, but if i missed something -- please let me know. ## Summary Fixes #182807. When the SYCL runtime wrapper is loaded on a host without a Level-Zero backend, `getDefaultDevice()` throws an `std::runtime_error`, `catchAll` catches it, and calls `abort()`, which results in a "PLEASE submit a bug report" stack dump, which is not correct for this kind of crash. With this change `catchAll` now writes to stderr and terminates via `std::exit(EXIT_FAILURE)`, yielding a clean exit code 1 with no crash dump. The "getDefaultDevice failed" message is also replaced with a (hopefully) better one. ## Implementation notes The runners in `mlir/lib/ExecutionEngine/` don't have much consistency with regards as to how to output diagnostics. 2 files use `llvm::outs/errs`, 9 files use `std::cout/cerr`, 8 files use `fprintf`. I left `fprintf` in place, since it was used in that module before and it's also used by CUDA and ROCm runners (the most dominant ones in the folder). I'm open to changing it to something else if needed. ## Notes for reviewers 1. Why this fix strategy doesn't match the CUDA/ROCm wrappers' print-warning-and-continue style: The `print-and-continue` approach was tried and it produces worse behavior with SYCL wrapper. The entry points would return `nullptr` on init failure, but the MLIR JIT runner does not null-check those returns, so the `nullptr` is dereferenced inside a later SYCL call and produces another SIGSEGV and the same crash dump this patch is trying to avoid. Making that style work would require null-checks across the downstream wrappers. We can go that way if desired, but i opted to start with a simpler fix. 2. Why `L0_SAFE_CALL` (`SyclRuntimeWrappers.cpp`, line 40) is left untouched: it's a deliberate decision, as `L0_SAFE_CALL` is supposed to abort on Level-Zero errors (kernel launch, OOM, etc). I can reconsider if there are some arguments for updating error handling strategy in `L0_SAFE_CALL` too, I just don't see any yet. 3. There's no unit test with this change, as I wasn't able to find a test that would test this logic (i guess this kind of logic isn't usually tested with unit tests anyway). I did test it manually, though, for both positive and negative cases.
Author
Parents
Loading