Improve bmm performance on CPU by applying TensorAccessor (#20266)
Summary:
Currently `bmm()` has very heavy performance overhead on CPU due to construction/deconstruction of `TensorImpl`. Applying `TensorAccessor` when indexing tensor data can greatly improve the performance.
I tested this on `fairseq` Transformer model. Results on Xeon 6148 (20*2 cores 2.5GHz) indicate this PR improves Transformer training performance by approximately **10%** (seconds per iteration reduced from **3.60** to **3.21**). Considering the fact that `bmm()` takes only **14%** of the total time, 10% overall improvement indicates `bmm()` itself improves by roughly **3x**.
Before:
```
| epoch 001: 0%| | 43/25337 [02:34<25:17:11, 3.60s/it, loss=16.179, nll_loss=16.137, ppl=72045.59, wps=1320, ups=0, wpb=4758.767, bsz=136.558, num_updates=43, lr=6.45e-06, gnorm=6.88
```
After:
```
| epoch 001: 0%| | 23/25337 [01:13<22:32:48, 3.21s/it, loss=17.072, nll_loss=17.068, ppl=137419.42, wps=1478, ups=0, wpb=4746.870, bsz=128.348, num_updates=23, lr=3.45e-06, gnorm=10.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20266
Differential Revision: D15262201
Pulled By: cpuhrsch
fbshipit-source-id: c2e4e406c06714b04cc7534f3da71e986eddca35