[UR][L0v2] Fix sync bug in enqueueEventsWaitWithBarrier (#21251)
`ur_queue_immediate_out_of_order_t::enqueueEventsWaitWithBarrier` has a
copy-paste bug where it waits for barrier events `N` times on the first
(internal) command list, instead of waiting on the `N` command lists
once each.
This is likely a copy-paste error from the preceding call to
`barrierFn`, that was not caught in testing or code review.
The bug does not seem to reproduce on any released GPUs on Linux, it
looks as-if waiting for any event on a single command-list blocks
dispatch from every other command-list on all our current GPUs. However
I did not investigate this deeply, because I believe this is a clear
error on UR's side either way.
The bug IS reproducible on an Intel internal simulator; this is how I
caught it. I can provide more details on internal channels if desired.
For reference below is the reproducer used. Tested on BMG and Panther
Lake, where it passes both before and after the PR, and with the
simulator where it fails before, but is fixed by this change. The
reproducer also passes with the level zero V1 adapter on the simulated
device.
<details>
<summary>Reproducer</summary>
```cpp
int main(int argc, char *argv[]) {
sycl::queue q; // Out of order!
int tripCount = 200'000'000;
if (argc > 1)
tripCount = std::atoi(argv[1]);
int *a = sycl::malloc_shared<int>(1, q);
int *b = sycl::malloc_shared<int>(1, q);
q.single_task([=] {
float sum = 0;
for (int i = 0; i < tripCount; ++i)
sum += sycl::sqrt(float(i));
*a = (sum > 0);
});
q.ext_oneapi_submit_barrier();
q.single_task([=] { *b = *a + 1; });
q.wait();
std::cout << "a: " << *a << ", b: " << *b << std::endl;
if (*a != 1 || *b != 2) {
std::cout << "Test failed!" << std::endl;
return 1;
}
std::cout << "Test passed!" << std::endl;
}
```
</details>
I am unsure how a reasonable test might be written to cover this, please
advise if that's desired.
Fixes: https://github.com/intel/llvm/issues/20861