[pytorch/cuda] apply 16-bit mask to the index for device guard registry (#45485)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45485
Essentially this is the problem reported by ezyang: https://fb.workplace.com/groups/llvm.gcc/permalink/4053565044692080. There are two proposed fixes:
* https://github.com/pytorch/pytorch/pull/44883: this doesn't work because it fails some static assert at runtime
```
caffe2/c10/core/TensorOptions.h:553:1: error: static_assert failed due to requirement 'sizeof(c10::TensorOptions) <= sizeof(long) * 2' "TensorOptions must fit in 128-bits"
static_assert( sizeof(TensorOptions) <= sizeof(int64_t) * 2,
^
```
* https://github.com/pytorch/pytorch/pull/44885: to be tested
This diff is a temp hack to work around the problem. W/o this patch:
```
volatile size_t device_type = static_cast<size_t>(type);
auto p = device_guard_impl_registry[device_type].load();
C10_LOG_FIRST_N(WARNING, 10) << "XDW-fail: " << cntr << ", Device type: " << type << ", type cast: " << device_type << ", guard: " << p;
// output
XDW-fail: 1129, Device type: cuda, type cast: 65537, guard: 0
```
Another workaround is D23788441, which changes -O3 to -O2. So this seems to be a miscompilation for nvcc or the host compiler.
Reviewed By: ezyang
Differential Revision: D23972356
fbshipit-source-id: ab91fbbfccb6389052de216f95cf9a8265445aea