[CIR][CUDA][HIP] Support stream per thread kernel launch (#188004)
Related: #175871, #179278
When `-fgpu-default-stream=per-thread` is specified, CUDA and HIP
kernels should be launched using the per-thread stream variants of the
launch API instead of the default `cudaLaunchKernel`/`hipLaunchKernel`.
This PR implements that by selecting the correct launch function name in
`emitDeviceStubBodyNew`:
For CUDA: `cudaLaunchKernel_ptsz`
For HIP: `hipLaunchKernel_spt`
This matches the behavior of the OG CodeGen implementation in
`CGCUDANV.cpp`.