[NVPTX] Support for memory orderings for cmpxchg (#126159)
So far, all cmpxchg instructions were lowered to atom.cas. This change
adds support for memory orders in lowering. Specifically:
- For cmpxchg which are emulated, memory ordering is enforced by adding
fences around the emulation loops.
- For cmpxchg which are lowered to PTX directly, where the memory order
is supported in ptx, lower directly to the correct ptx instruction.
- For seq_cst cmpxchg which are lowered to PTX directly, use a sequence
(fence.sc; atom.cas.acquire) to provide the semantics that we want.
Also adds tests for all possible combinations of (size, memory ordering,
address space, SM/PTX versions)
This also adds `atomicOperationOrderAfterFenceSplit` in TargetLowering,
for specially handling seq_cst atomics.