[mlir][SYCL] Fail init errors cleanly instead of `abort`ing (#192979)
Disclaimer: this is my first PR to LLVM. I'm trying to follow the
contribution guide and the conventions i see in other PRs, but if i
missed something -- please let me know.
## Summary
Fixes #182807.
When the SYCL runtime wrapper is loaded on a host without a Level-Zero
backend, `getDefaultDevice()` throws an `std::runtime_error`, `catchAll`
catches it, and calls `abort()`, which results in a "PLEASE submit a bug
report" stack dump, which is not correct for this kind of crash.
With this change `catchAll` now writes to stderr and terminates via
`std::exit(EXIT_FAILURE)`, yielding a clean exit code 1 with no crash
dump. The "getDefaultDevice failed" message is also replaced with a
(hopefully) better one.
## Implementation notes
The runners in `mlir/lib/ExecutionEngine/` don't have much consistency
with regards as to how to output diagnostics. 2 files use
`llvm::outs/errs`, 9 files use `std::cout/cerr`, 8 files use `fprintf`.
I left `fprintf` in place, since it was used in that module before and
it's also used by CUDA and ROCm runners (the most dominant ones in the
folder). I'm open to changing it to something else if needed.
## Notes for reviewers
1. Why this fix strategy doesn't match the CUDA/ROCm wrappers'
print-warning-and-continue style: The `print-and-continue` approach was
tried and it produces worse behavior with SYCL wrapper. The entry points
would return `nullptr` on init failure, but the MLIR JIT runner does not
null-check those returns, so the `nullptr` is dereferenced inside a
later SYCL call and produces another SIGSEGV and the same crash dump
this patch is trying to avoid. Making that style work would require
null-checks across the downstream wrappers. We can go that way if
desired, but i opted to start with a simpler fix.
2. Why `L0_SAFE_CALL` (`SyclRuntimeWrappers.cpp`, line 40) is left
untouched: it's a deliberate decision, as `L0_SAFE_CALL` is supposed to
abort on Level-Zero errors (kernel launch, OOM, etc). I can reconsider
if there are some arguments for updating error handling strategy in
`L0_SAFE_CALL` too, I just don't see any yet.
3. There's no unit test with this change, as I wasn't able to find a
test that would test this logic (i guess this kind of logic isn't
usually tested with unit tests anyway). I did test it manually, though,
for both positive and negative cases.