pytorch
dcc159d3 - inductor: pre-convert a TensorBox's layout to FixedLayout at FX side if one user of it is a CPU external customer kernel (#95873)

Commit
2 years ago
inductor: pre-convert a TensorBox's layout to FixedLayout at FX side if one user of it is a CPU external customer kernel (#95873) Given the following case: ``` import torch import torch._dynamo class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__() self.conv1 = torch.nn.Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1)) self.conv2 = torch.nn.Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1)) self.silu = torch.nn.SiLU(inplace=False) def forward(self, x,): x = self.silu(x) y1 = self.conv1(x) y2 = self.conv2(x) return y1, y2 model = Model().eval() model = model.to(memory_format=torch.channels_last).eval() opt_model = torch._dynamo.optimize('inductor')(model) x = torch.randn(128, 64, 112, 112).to(memory_format=torch.channels_last) with torch.no_grad(): for i in range(3): out = opt_model(x) ``` the silu is used by two external kernels, and there always have redundant memory copy: ``` kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/dl/cdljpywww2h2ag4o35mwbvm45hhasxnxkhqgbupxnk3y7olula65.h" extern "C" void kernel(const float* __restrict__ in_ptr0, float* __restrict__ out_ptr0, float* __restrict__ out_ptr1) { #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=0; i0<6422528; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16*i0); auto tmp1 = decltype(tmp0)(1)/(decltype(tmp0)(1) + tmp0.neg().exp()); auto tmp2 = tmp0 * tmp1; tmp2.store(out_ptr0 + 16*i0); tmp2.store(out_ptr1 + 16*i0); } #pragma omp for simd simdlen(8) for(long i0=102760448; i0<102760448; i0+=1) { auto tmp0 = in_ptr0[i0]; auto tmp1 = decltype(tmp0)(1) / (decltype(tmp0)(1) + std::exp(-tmp0)); auto tmp2 = tmp0 * tmp1; out_ptr0[i0] = tmp2; out_ptr1[i0] = tmp2; } } } } ''') ``` This PR will pre-convert the `silu`'s layout to FixedLayout at FX side(will be realized to avoid multi-realize at external kernel) if one user of it is a CPU external customer kernel, after this PR, the output code is: ``` kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/dl/cdljpywww2h2ag4o35mwbvm45hhasxnxkhqgbupxnk3y7olula65.h" extern "C" void kernel(const float* __restrict__ in_ptr0, float* __restrict__ out_ptr0) { #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=0; i0<6422528; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16*i0); auto tmp1 = decltype(tmp0)(1)/(decltype(tmp0)(1) + tmp0.neg().exp()); auto tmp2 = tmp0 * tmp1; tmp2.store(out_ptr0 + 16*i0); } #pragma omp for simd simdlen(8) for(long i0=102760448; i0<102760448; i0+=1) { auto tmp0 = in_ptr0[i0]; auto tmp1 = decltype(tmp0)(1) / (decltype(tmp0)(1) + std::exp(-tmp0)); auto tmp2 = tmp0 * tmp1; out_ptr0[i0] = tmp2; } } } } ''') ``` Currently, this PR only considers the CPU external customer kernel, but for other external kernels, there may have the same issue. For Timm **eca_halonext26ts** , this PR will give about **8%** performance improvement(BS=128, 20 cores on SKX). Pull Request resolved: https://github.com/pytorch/pytorch/pull/95873 Approved by: https://github.com/jansel
Author
Committer
Parents
Loading