pytorch
d4c8e37b - Improve performance for unary kernels using vml (#91963)

Commit

2 years ago

Improve performance for unary kernels using vml (#91963) This gives some speedups for kernels implemented with `at::vml`: - Make vml ops serial and use `TensorIterator.for_each` for better parallism with discontiguous tensors - Reduce buffer size for discontiguous data to 8 KiB to increase chance of fitting in L1d cache, but is still wide enough to utilize AVX-512. - Avoid a copy if only one of input and output is discontiguous There is no change for contiguous tensors, but I see significant speedup for the following benchmarks: ``` import torch a = torch.randn(2*10**6, device="cpu") %timeit a.view(100, 20000)[:,::2].sqrt() %timeit a.view(200, 10000)[::2].sqrt() ``` For discontiguous last dimension I see a 27x speedup and for discontiguous batch dimension I see an 8x speedup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91963 Approved by: https://github.com/jgong5

Author

peterbell10

Committer

pytorchmergebot

Parents

0de81906

pytorch d4c8e37b - Improve performance for unary kernels using vml (#91963)

pytorch
d4c8e37b - Improve performance for unary kernels using vml (#91963)