Explicit vectorization support for TorchInductor (#87068)
In this PR, we replace OMP SIMD with `aten::vec` to optimize TorchInductor vectorization performance. Take `res=torch.exp(torch.add(x, y))` as the example. The generated code is as follows if `config.cpp.simdlen` is 8.
```C++
extern "C" void kernel(const float* __restrict__ in_ptr0,
const float* __restrict__ in_ptr1,
float* __restrict__ out_ptr0,
const long ks0,
const long ks1)
{
#pragma omp parallel num_threads(48)
{
#pragma omp for
for(long i0=0; i0<((ks0*ks1) / 8); ++i0)
{
auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 8*i0);
auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + 8*i0);
auto tmp2 = tmp0 + tmp1;
auto tmp3 = tmp2.exp();
tmp3.store(out_ptr0 + 8*i0);
}
#pragma omp for simd simdlen(4)
for(long i0=8*(((ks0*ks1) / 8)); i0<ks0*ks1; ++i0)
{
auto tmp0 = in_ptr0[i0];
auto tmp1 = in_ptr1[i0];
auto tmp2 = tmp0 + tmp1;
auto tmp3 = std::exp(tmp2);
out_ptr0[i0] = tmp3;
}
}
}
```
The major pipeline is as follows.
- Check whether the loop body could be vectorized by `aten::vec`. The checker consists of two parts. [One ](https://github.com/pytorch/pytorch/blob/bf66991fc4860724368c5289d3db81de591b4cb2/torch/_inductor/codegen/cpp.py#L702)is to check whether all the `ops` have been supported. The [other one](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L672) is to check whether the data access could be vectorized.
- [`CppSimdVecKernelChecker`](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L655)
- Create the `aten::vec` kernel and original omp simd kernel. Regarding the original omp simd kernel, it serves for the tail loop when the loop is vectorized.
- [`CppSimdVecKernel`](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L601)
- [`CppSimdVecOverrides`](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L159): The ops that we have supported on the top of `aten::vec`
- Create kernel
- [`aten::vec` kernel](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L924)
- [`Original CPP kernel - OMP SIMD`](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L929)
- Generate code
- [`CppKernelProxy`](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L753) is used to combine the `aten::vec` kernel and original cpp kernel
- [Vectorize the most inner loop](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L753)
- [Generate code](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L821)
Next steps:
- [x] Support reduction
- [x] Vectorize the tail loop with `aten::vec`
- [ ] Support BF16
- [ ] Optimize the loop condition and loop index calculation by replacing `div` with `add`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87068
Approved by: https://github.com/jgong5, https://github.com/jansel