During inbatch broadcast, move Tile op after Fused8BitRowwiseQuantizedToFloat if applicable (#41464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41464
If input is int8 rowwise quantized, currently we cannot low it to Glow. And previously, we had some error when running with inbatch broadcast. The main issue is that Tile op doesn't support uint8_t type, which is very easily added here. However, this will result in non-ideal situation that we will leave Tile -> Fused8BitRowwiseQuantizedToFloat on host side, which probably hurt the memory bw a lot. Even we later add the support to Fused8BitRowwiseQuantizedToFloat in Glow, it's still not ideal because we are doing redudant compute on identical columns. So the solution here is to swap the order of Fused8BitRowwiseQuantizedToFloat and Tile to make it Tile -> Fused8BitRowwiseQuantizedToFloat. In this way, it will resolve the error we saw immediately. For the short term, we can still run Tile in card. And for longer term, things runs faster on card.
The optimization is a heuristic. If in the net, there isn't such pattern, inbatch broadcast will work as it was before.
(Note: this ignores all push blocking failures!)
Test Plan:
```
buck test caffe2/caffe2/opt/custom:in_batch_broadcast_test
```
Reviewed By: benjibc
Differential Revision: D22544162
fbshipit-source-id: b6dd36a5925a9c8103b80f034e7730a7a085a6ff