Synchronize MAGMA functions with the current CUDA stream (#36605)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/21821
This follows ngimel's [suggestion](https://github.com/pytorch/pytorch/issues/21821#issuecomment-502968982) to manually synchronize MAGMA calls with the current stream. This is handled automatically with `MagmaStreamSyncGuard`.
I think for the functions with `_batched` variants we could possibly avoid synchronisation by using a batch of size 1 since these have a `magma_queue_t` argument. However, I presume there's a reason it wasn't written like that in the first place.
I also figured out why porting to aten ["magically fixed"](https://github.com/pytorch/pytorch/issues/21821#issuecomment-527647971) `torch.svd`. The magma functions for svd all take host arrays as input and output. The ATen port uses blocking `copy_`s which fully synchronize the operation. On the other hand, the THC functions use `cudaMemcpy` which doesn't synchronize with streams created with `cudaStreamNonBlocking` (which `aten` does). The fix is to use `cudaMemcpyAsync` and `cudaStreamSynchronize`, the same as `copy_` does internally:
https://github.com/pytorch/pytorch/blob/835ee34e38eed3f5b35726b40be9c48e75201618/aten/src/ATen/native/cuda/Copy.cu#L192-L193
I'm not sure how to test these changes as I wasn't able to reproduce any of the stream sync issues. Possibly a mixture of non-determinism and because some of these functions are implicitly synchronous anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36605
Differential Revision: D21258265
Pulled By: ngimel
fbshipit-source-id: 76d8f687c605e5e9cd68b97dc1d70a39a13376ec