Enable faster cuBLAS path for torch.linalg.lstsq for batch of small matrices
This PR enables cuBLAS path for `torch.linalg.lstsq`. Before this PR only cuSOLVER path was used for regular PyTorch builds (when built with MAGMA).
Performance results (also previously reported at https://github.com/pytorch/pytorch/pull/54725#issuecomment-832234456):
```
| | before current PR | current PR | speedup |
|----------------------------|-------------------|------------|---------|
| torch.Size([32, 32, 32]) | 870 | 440 | 2x |
| torch.Size([64, 32, 32]) | 1340 | 450 | 3x |
| torch.Size([32, 64, 64]) | 9040 | 1839 | 5x |
| torch.Size([64, 64, 64]) | 17000 | 1830 | 9.2x |
| torch.Size([32, 128, 128]) | 23210 | 8560 | 2.7x |
| torch.Size([64, 128, 128]) | 40000 | 8662 | 4.6x |
| torch.Size([32, 256, 256]) | 58160 | 46150 | 1.2x |
| torch.Size([64, 256, 256]) | 73220 | 52080 | 1.4x |
Times are in microseconds (us).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74434
Approved by: https://github.com/mruberry