bypass `getDeviceFromPtr` check when device is known (#36714)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/36594
In some cases, when using memory that was allocated in another process before doing any memory-related operation in PyTorch, there are errors because the GPU CUDA context is not completely initialized.
I guess there is an explicit reason to leave the context not initialized at first, and don't do it in `THCudaInit` where other CUDA calls are going on.
I'd like to discuss it in this PR.
Possible better solutions are
Initialize the device context in `fromDLPack` or `from_blob`, probably by creating some dummy array with one element. But this feels like a hack.
Another possibility is to catch the exception in `getDeviceFromPtr`, check if the context was initialized, and if not repeat this operation. but we will need to check for every device.
This PR bypasses the `getDeviceFromPtr` call which is the one causing the problem if we already know the device. This allows us to create the Tensor from the shared memory storage but the context will not be initialized. However, it will be when the tensor is accessed later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36714
Differential Revision: D21504557
Pulled By: ngimel
fbshipit-source-id: 173ccdeb7c2a2b0ece53dd50be97f2df577a5634