Added cuBLAS path for torch.linalg.lstsq (#54725)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54725
cuBLAS's gelsBatched is faster than MAGMA's for matrices with rows less
than 128.
Performance comparison cuSOLVER vs cuBLAS: https://github.com/pytorch/pytorch/pull/54725#issuecomment-832234456.
Performance comparison MAGMA vs cuBLAS: https://github.com/pytorch/pytorch/pull/54725#issuecomment-827649039.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D28248803
Pulled By: mruberry
fbshipit-source-id: d3661bccb85c6fc1cee3a246ae8233492964f400