inductor: pre-convert a TensorBox's layout to FixedLayout at FX side if one user of it is a CPU external customer kernel (#95873)
Given the following case:
```
import torch
import torch._dynamo
class Model(torch.nn.Module):
def __init__(self):
super(Model, self).__init__()
self.conv1 = torch.nn.Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
self.conv2 = torch.nn.Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
self.silu = torch.nn.SiLU(inplace=False)
def forward(self, x,):
x = self.silu(x)
y1 = self.conv1(x)
y2 = self.conv2(x)
return y1, y2
model = Model().eval()
model = model.to(memory_format=torch.channels_last).eval()
opt_model = torch._dynamo.optimize('inductor')(model)
x = torch.randn(128, 64, 112, 112).to(memory_format=torch.channels_last)
with torch.no_grad():
for i in range(3):
out = opt_model(x)
```
the silu is used by two external kernels, and there always have redundant memory copy:
```
kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/dl/cdljpywww2h2ag4o35mwbvm45hhasxnxkhqgbupxnk3y7olula65.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
float* __restrict__ out_ptr0,
float* __restrict__ out_ptr1)
{
#pragma omp parallel num_threads(40)
{
{
#pragma omp for
for(long i0=0; i0<6422528; i0+=1)
{
auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16*i0);
auto tmp1 = decltype(tmp0)(1)/(decltype(tmp0)(1) + tmp0.neg().exp());
auto tmp2 = tmp0 * tmp1;
tmp2.store(out_ptr0 + 16*i0);
tmp2.store(out_ptr1 + 16*i0);
}
#pragma omp for simd simdlen(8)
for(long i0=102760448; i0<102760448; i0+=1)
{
auto tmp0 = in_ptr0[i0];
auto tmp1 = decltype(tmp0)(1) / (decltype(tmp0)(1) + std::exp(-tmp0));
auto tmp2 = tmp0 * tmp1;
out_ptr0[i0] = tmp2;
out_ptr1[i0] = tmp2;
}
}
}
}
''')
```
This PR will pre-convert the `silu`'s layout to FixedLayout at FX side(will be realized to avoid multi-realize at external kernel) if one user of it is a CPU external customer kernel, after this PR, the output code is:
```
kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/dl/cdljpywww2h2ag4o35mwbvm45hhasxnxkhqgbupxnk3y7olula65.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
float* __restrict__ out_ptr0)
{
#pragma omp parallel num_threads(40)
{
{
#pragma omp for
for(long i0=0; i0<6422528; i0+=1)
{
auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16*i0);
auto tmp1 = decltype(tmp0)(1)/(decltype(tmp0)(1) + tmp0.neg().exp());
auto tmp2 = tmp0 * tmp1;
tmp2.store(out_ptr0 + 16*i0);
}
#pragma omp for simd simdlen(8)
for(long i0=102760448; i0<102760448; i0+=1)
{
auto tmp0 = in_ptr0[i0];
auto tmp1 = decltype(tmp0)(1) / (decltype(tmp0)(1) + std::exp(-tmp0));
auto tmp2 = tmp0 * tmp1;
out_ptr0[i0] = tmp2;
}
}
}
}
''')
```
Currently, this PR only considers the CPU external customer kernel, but for other external kernels, there may have the same issue.
For Timm **eca_halonext26ts** , this PR will give about **8%** performance improvement(BS=128, 20 cores on SKX).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95873
Approved by: https://github.com/jansel