Gather to Slice Fusion (#13599)
This PR is to optimize the running for below code from Huggingface's
XLNet model.
```
x = torch.index_select(x, 3, torch.arange(klen, device=x.device, dtype=torch.long))
```
The code will be exported to Range->Gather, which can be fused to a
Slice Op. Slice kernel is much faster than Gather, especially for
backward run. The main reason is for Gather, the data in indices can be
duplicated so that it needs sum during backward, but Slice node cannot
have such case.
Use Huggingface's XLNet model for profiling.
- Before the fuse
forward, ~753us

backward, ~46101us

- After the fuse
forward, ~627us

backward, ~677us
