pytorch
b8fb6eae - Improve bmm() performance on CPU when input tensor is non-contiguous (#19338)

Commit View On GitHub

Commit

5 years ago

Improve bmm() performance on CPU when input tensor is non-contiguous (#19338) Summary: This PR aims to improve Transformer performance on CPU, `bmm()` is one of the major bottlenecks now. Current logic of `bmm()` on CPU only uses MKL batch gemm when the inputs `A` and `B` are contiguous or transposed. So when `A` or `B` is a slice of a larger tensor, it falls to a slower path. `A` and `B` are both 3D tensors. MKL is able to handle the batch matrix multiplication on occasion that `A.stride(1) == 1 || A.stride(2) == 1` and `B.stride(1) == || B.stride(2) == 1`. From [fairseq](https://github.com/pytorch/fairseq) implementation of Transformer, multi-head attention has two places to call bmm(), [here](https://github.com/pytorch/fairseq/blob/master/fairseq/modules/multihead_attention.py#L167) and [here](https://github.com/pytorch/fairseq/blob/master/fairseq/modules/multihead_attention.py#L197), `q`, `k`, `v` are all slices from larger tensor. So the `bmm()` falls to slow path at the moment. Results on Xeon 6148 (20*2 cores 2.5GHz) indicate this PR improves Transformer training performance by **48%** (seconds per iteration reduced from **5.48** to **3.70**), the inference performance should also be boosted. Before: ``` | epoch 001: 0%| | 27/25337 [02:27<38:31:26, 5.48s/it, loss=16.871, nll_loss=16.862, ppl=119099.70, wps=865, ups=0, wpb=4715.778, bsz=129.481, num_updates=27, lr=4.05e-06, gnorm=9.133, ``` After: ``` | epoch 001: 0%| | 97/25337 [05:58<25:55:49, 3.70s/it, loss=14.736, nll_loss=14.571, ppl=24339.38, wps=1280, ups=0, wpb=4735.299, bsz=131.134, num_updates=97, lr=1.455e-05, gnorm=3.908, ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/19338 Differential Revision: D14986346 Pulled By: soumith fbshipit-source-id: 827106245af908b8a4fda69ed0288d322b028f08

Author

mingfeima

Committer

facebook-github-bot

Parents

12d6f79e

pytorch b8fb6eae - Improve bmm() performance on CPU when input tensor is non-contiguous (#19338)

Commit

pytorch
b8fb6eae - Improve bmm() performance on CPU when input tensor is non-contiguous (#19338)