pytorch
298075e1 - use aten parallel on lu factor (#93037)

Commit

1 year ago

use aten parallel on lu factor (#93037) https://github.com/pytorch/pytorch/issues/91536. One issue mentioned torch.inv is pretty slow for large batches with small matrices on cuda. I checked the CPU implementations and found we have an optimize opportunity. For torch.inv, the CPU pass chooses to solve it by `lu_factor` + `lu_solve`. The `lu_factor` loop on `batch_size` dimension and the parallel happened inside lapack - For small matrix, the computational complexity is too tiny to parallel inside lapack. - Even for large matrix, the parallelization efficiency is not good in lapack ( it performs worse than using at::parallel outside) - Only for small batch size + small matrix size, the omp overhead will take too large overhead. Based on the above observations, using at::parallel outside on lu_factor will have a pretty large benefit. Here is the code/data collected on 32 core ICX system. ```python import torch import time def bench(bs, r): x = torch.randn(int(bs), r, r) start = time.time() for i in range(100): y1 = torch.linalg.lu_factor(x) end = time.time() print(r, bs) print(end - start) print((end - start)/(r**3)) for r in (4, 16, 64): for bs in (1e2, 1e4, 1e6): bench(bs, r) ``` | bs/rank | 100/4 | 10000/4 | 1000000/4 | 100/16 | 10000/16| 1000000/16| 100/64| 10000/64| 1000000/64| | ---- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | parallel inside lapack | 0.0028 |1.077 | 11.99|0.0163 | 1.5260|153.17 |0.2021|20.93 | 1877| | parallel outside lapack | 0.0087 | 0.0247 | 1.566| 0.0044|0.1678 |17.63|0.038|2.311 | 208.6| |speed up ratio| 0.32x | 43.6x | 7.65x|3.70x |9.09x |8.69x |5.32x |9.06x |9x | Pull Request resolved: https://github.com/pytorch/pytorch/pull/93037 Approved by: https://github.com/lezcano

Author

zhuhaozhe

Committer

pytorchmergebot

Parents

bdca5fcd

pytorch 298075e1 - use aten parallel on lu factor (#93037)

pytorch
298075e1 - use aten parallel on lu factor (#93037)