pytorch
6541e51f - Explicit vectorization support for TorchInductor (#87068)

Commit

3 years ago

Explicit vectorization support for TorchInductor (#87068) In this PR, we replace OMP SIMD with `aten::vec` to optimize TorchInductor vectorization performance. Take `res=torch.exp(torch.add(x, y))` as the example. The generated code is as follows if `config.cpp.simdlen` is 8. ```C++ extern "C" void kernel(const float* __restrict__ in_ptr0, const float* __restrict__ in_ptr1, float* __restrict__ out_ptr0, const long ks0, const long ks1) { #pragma omp parallel num_threads(48) { #pragma omp for for(long i0=0; i0<((ks0*ks1) / 8); ++i0) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 8*i0); auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + 8*i0); auto tmp2 = tmp0 + tmp1; auto tmp3 = tmp2.exp(); tmp3.store(out_ptr0 + 8*i0); } #pragma omp for simd simdlen(4) for(long i0=8*(((ks0*ks1) / 8)); i0<ks0*ks1; ++i0) { auto tmp0 = in_ptr0[i0]; auto tmp1 = in_ptr1[i0]; auto tmp2 = tmp0 + tmp1; auto tmp3 = std::exp(tmp2); out_ptr0[i0] = tmp3; } } } ``` The major pipeline is as follows. - Check whether the loop body could be vectorized by `aten::vec`. The checker consists of two parts. [One ](https://github.com/pytorch/pytorch/blob/bf66991fc4860724368c5289d3db81de591b4cb2/torch/_inductor/codegen/cpp.py#L702)is to check whether all the `ops` have been supported. The [other one](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L672) is to check whether the data access could be vectorized. - [`CppSimdVecKernelChecker`](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L655) - Create the `aten::vec` kernel and original omp simd kernel. Regarding the original omp simd kernel, it serves for the tail loop when the loop is vectorized. - [`CppSimdVecKernel`](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L601) - [`CppSimdVecOverrides`](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L159): The ops that we have supported on the top of `aten::vec` - Create kernel - [`aten::vec` kernel](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L924) - [`Original CPP kernel - OMP SIMD`](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L929) - Generate code - [`CppKernelProxy`](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L753) is used to combine the `aten::vec` kernel and original cpp kernel - [Vectorize the most inner loop](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L753) - [Generate code](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L821) Next steps: - [x] Support reduction - [x] Vectorize the tail loop with `aten::vec` - [ ] Support BF16 - [ ] Optimize the loop condition and loop index calculation by replacing `div` with `add` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87068 Approved by: https://github.com/jgong5, https://github.com/jansel

Author

EikanWang

Committer

pytorchmergebot

Parents

a95419b4

pytorch 6541e51f - Explicit vectorization support for TorchInductor (#87068)

pytorch
6541e51f - Explicit vectorization support for TorchInductor (#87068)