[TensorExpr] more convenient outer Rfactor output (#40050)
Summary:
Auto fuse the output loops of outer Rfactors, so it is in a more convenient format for binding GPU axes.
An example:
```
Tensor* c = Reduce("sum", {}, Sum(), b, {{m, "m"}, {n, "n"}, {k, "k"}});
LoopNest loop({c});
std::vector<For*> loops = loop.getLoopStmtsFor(c);
auto v = loops.at(0)->var();
loop.rfactor(c->body(), v);
```
Before:
```
{
Allocate(tmp_buf, float, {m});
sum[0] = 0.f;
for (int m_1 = 0; m_1 < m; m_1++) {
tmp_buf[m_1] = 0.f;
}
for (int m_1 = 0; m_1 < m; m_1++) {
for (int n = 0; n < n_1; n++) {
for (int k = 0; k < k_1; k++) {
tmp_buf[m_1] = (tmp_buf[m_1]) + (b[((n_1 * m_1) * k_1 + k) + k_1 * n]);
}
}
}
for (int m_1 = 0; m_1 < m; m_1++) {
sum[0] = (sum[0]) + (tmp_buf[m_1]);
}
Free(tmp_buf);
}
```
After:
```
{
sum[0] = 0.f;
for (int m = 0; m < m_1; m++) {
Allocate(tmp_buf, float, {m_1});
tmp_buf[m] = 0.f;
for (int n = 0; n < n_1; n++) {
for (int k = 0; k < k_1; k++) {
tmp_buf[m] = (tmp_buf[m]) + (b[((n_1 * m) * k_1 + k) + k_1 * n]);
}
}
sum[0] = (sum[0]) + (tmp_buf[m]);
Free(tmp_buf);
}
}
```
The existing Rfactor tests cover this case, although I did rename a few for clarity. This change broke the LLVMRFactorVectorizedReduction test because it now does what its intending to (vectorize a loop with a reduction in it) rather than nothing, and since that doesn't work it correctly fails. I've disabled it for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40050
Reviewed By: ZolotukhinM
Differential Revision: D22605639
Pulled By: nickgg
fbshipit-source-id: e359be53ea62d9106901cfbbc42d55d0e300e8e0