[NNC] Registerizer for GPU [1/x] (#42606)
Summary:
Adds a new optimization pass, the Registerizer, which looks for common Stores and Loads to a single item in a buffer and replaces them with a local temporary scalar which is cheaper to write.
For example it can replace:
```
A[0] = 0;
for (int x = 0; x < 10; x++) {
A[0] = (A[0]) + x;
}
```
with:
```
int A_ = 0;
for (int x = 0; x < 10; x++) {
A_ = x + A_;
}
A[0] = A_;
```
This is particularly useful on GPUs when parallelizing, since after replacing loops with metavars we have a lot of accesses like this. Early tests of simple reductions on a V100 indicates this can speed them up by ~5x.
This diff got a bit unwieldy with the integration code so that will come in a follow up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42606
Reviewed By: bertmaher
Differential Revision: D22970969
Pulled By: nickgg
fbshipit-source-id: 831fd213f486968624b9a4899a331ea9aeb40180