BERT CPU performance optimization: use mkldnn for nn.Linear() when input is dense layout (#21851)
Summary:
This PR aims at improving BERT performance on CPU by using `mkldnn` inner product for `nn.Linear()`.
The current logic is to use `mkldnn` only when `input` tensor is of mkldnn layout. This PR loosens this condition, `mkldnn` will be used for `nn.Linear()` when `input` tensor is of dense layout. The aten tensor is viewed inplace in `mkldnn` without additional memory copy.
1. when `input.dim() >= 3` , it is viewed as 2d tensor. e.g. `[T, N, C]` is treated as `[TN, C]`;
2. when `input` is not contiguous, it is copied so as to be contiguous. `mkldnn` inner product can't handle non-contiguous memory.
With this PR, BERT on `glue/MRPC` inference (batch size = 1) on Xeon 6148 single socket (20 cores@2.5GHz) improves by `44%`:
1. before (unit: iterations/sec):
```bash
408/408 [00:24<00:00, 16.69it/s]
```
2. after (unit: iterations/sec):
```bash
408/408 [00:16<00:00, 24.06it/s]
```
The latency reduces from `59.92 ms` to `41.56ms` correspondingly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21851
Differential Revision: D16056334
Pulled By: dzhulgakov
fbshipit-source-id: 9b70ed58323b5e2f3f4e3ebacc766a74a8b68a8a