use aten parallel on lu factor (#93037)
https://github.com/pytorch/pytorch/issues/91536. One issue mentioned torch.inv is pretty slow for large batches with small matrices on cuda.
I checked the CPU implementations and found we have an optimize opportunity.
For torch.inv, the CPU pass chooses to solve it by `lu_factor` + `lu_solve`.
The `lu_factor` loop on `batch_size` dimension and the parallel happened inside lapack
- For small matrix, the computational complexity is too tiny to parallel inside lapack.
- Even for large matrix, the parallelization efficiency is not good in lapack ( it performs worse than using at::parallel outside)
- Only for small batch size + small matrix size, the omp overhead will take too large overhead.
Based on the above observations, using at::parallel outside on lu_factor will have a pretty large benefit.
Here is the code/data collected on 32 core ICX system.
```python
import torch
import time
def bench(bs, r):
x = torch.randn(int(bs), r, r)
start = time.time()
for i in range(100):
y1 = torch.linalg.lu_factor(x)
end = time.time()
print(r, bs)
print(end - start)
print((end - start)/(r**3))
for r in (4, 16, 64):
for bs in (1e2, 1e4, 1e6):
bench(bs, r)
```
| bs/rank | 100/4 | 10000/4 | 1000000/4 | 100/16 | 10000/16| 1000000/16| 100/64| 10000/64| 1000000/64|
| ---- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| parallel inside lapack | 0.0028 |1.077 | 11.99|0.0163 | 1.5260|153.17 |0.2021|20.93 | 1877|
| parallel outside lapack | 0.0087 | 0.0247 | 1.566| 0.0044|0.1678 |17.63|0.038|2.311 | 208.6|
|speed up ratio| 0.32x | 43.6x | 7.65x|3.70x |9.09x |8.69x |5.32x |9.06x |9x |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93037
Approved by: https://github.com/lezcano