[Clang] Update the 'gpuintrin.h' lane scan handling (#185451)
Summary:
This patch uses a more efficient algorithm for the reduction rather than
a divergent branch. We also provide a prefix and suffix version, the sum
is now just the first element of this.
This changes the name to this, which is technically breaking but I don't
think these were really used in practice and it's a trivial change based
on the clang version if it's really needed..
```
__gpu_prefix_scan_sum_u32(...)
__gpu_suffix_scan_sum_u32(...)
```