[flang][cuda][NFC] Use NVVM operation for thread syncs (#166999)
Use the operation introduced in #166698. Also split the test into a new
file since `flang/test/Lower/CUDA/cuda-device-proc.cuf` is getting to
big. I'm planning to reorganize this file to have better separation of
the tests