Fix test_inverse_singular for cublas path; fix cusolver inverse multi-stream issue (#47026)
Summary:
### test_inverse_singular for cublas failure
Related
https://github.com/pytorch/pytorch/pull/46616#issuecomment-718102758
https://app.circleci.com/pipelines/github/pytorch/pytorch/232112/workflows/4131d4ca-cd51-44e3-8e6c-b1c3555c62fa/jobs/8523970/tests
The cuda 11.1 CI container doesn't have MAGMA library, so cublas matrix inverse path is enabled.
```
Oct 27 23:13:47 -- MAGMA not found. Compiling without MAGMA support
```
The test_inverse_singular was introduced in https://github.com/pytorch/pytorch/pull/46625, but I forgot to fix that functionality for cublas path as well.
### cusolver inverse multi-stream failure
fix https://github.com/pytorch/pytorch/issues/47272
The original cuda event record/block stream was wrong, which could cause NaN in output tensor.
On my machine, the original code observes NaN in about 50k~500k loops. After this change, no NaN is observed in more than 2.5m loops.
The performance for batch 2 matrix inverse is still the same as those in https://github.com/pytorch/pytorch/issues/42403.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47026
Reviewed By: mruberry
Differential Revision: D24838546
Pulled By: ngimel
fbshipit-source-id: 3b83e4ab8e6b47a8273cba277251765bd6d97911