pytorch
21ef4cc6 - Improve bmm performance on CPU by applying TensorAccessor (#20266)

Commit View On GitHub

Commit

5 years ago

Improve bmm performance on CPU by applying TensorAccessor (#20266) Summary: Currently `bmm()` has very heavy performance overhead on CPU due to construction/deconstruction of `TensorImpl`. Applying `TensorAccessor` when indexing tensor data can greatly improve the performance. I tested this on `fairseq` Transformer model. Results on Xeon 6148 (20*2 cores 2.5GHz) indicate this PR improves Transformer training performance by approximately **10%** (seconds per iteration reduced from **3.60** to **3.21**). Considering the fact that `bmm()` takes only **14%** of the total time, 10% overall improvement indicates `bmm()` itself improves by roughly **3x**. Before: ``` | epoch 001: 0%| | 43/25337 [02:34<25:17:11, 3.60s/it, loss=16.179, nll_loss=16.137, ppl=72045.59, wps=1320, ups=0, wpb=4758.767, bsz=136.558, num_updates=43, lr=6.45e-06, gnorm=6.88 ``` After: ``` | epoch 001: 0%| | 23/25337 [01:13<22:32:48, 3.21s/it, loss=17.072, nll_loss=17.068, ppl=137419.42, wps=1478, ups=0, wpb=4746.870, bsz=128.348, num_updates=23, lr=3.45e-06, gnorm=10. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/20266 Differential Revision: D15262201 Pulled By: cpuhrsch fbshipit-source-id: c2e4e406c06714b04cc7534f3da71e986eddca35

Author

mingfeima

Committer

facebook-github-bot

Parents

fa189641

pytorch 21ef4cc6 - Improve bmm performance on CPU by applying TensorAccessor (#20266)

Commit

pytorch
21ef4cc6 - Improve bmm performance on CPU by applying TensorAccessor (#20266)