[CUDA] refactor in-header implementation of __ld*/__st* with different cache modes. (#190021)
* Generalized creation of the variant sets.
* Added implementations for the missing operation modes. Now we match
what's available in CUDA headers.
* Cleaned up discrepancies in `__asm__ __volatile__` use (needed for
some ops that warm up the cache, but should not be discarded if the load
result is unused)
Manually verified that clang's versions of these functions generate
exactly the same instructions nvcc generates from CUDA headers.