add eigen blas for mobile build (#26508)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26508
Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's nice to have good fallback
implementation for other ops.
Test Plan:
- Create a simple matrix multiplication script model:
```
import torch
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.weights = torch.ones(1000, 1000)
def forward(self, x):
return torch.mm(x, self.weights)
n = Net()
module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)})
module.save('mm.pk')
```
- Before integrate with eigen blas:
```
adb shell 'cd /data/local/tmp; \
./speed_benchmark_torch \
--model=mm.pk \
--input_dims="1000,1000" \
--input_type=float \
--warmup=5 \
--iter=5'
Milliseconds per iter: 2218.52.
```
- After integrate with eigen blas:
```
adb shell 'cd /data/local/tmp; \
./speed_benchmark_torch_eigen \
--model=mm.pk \
--input_dims="1000,1000" \
--input_type=float \
--warmup=5 \
--iter=5'
Milliseconds per iter: 314.535.
```
- Improve MobileNetV2 single thread perf by ~5%:
```
adb shell 'cd /data/local/tmp; \
./speed_benchmark_torch \
--model=mobilenetv2.pk \
--input_dims="1,3,224,224" \
--input_type=float \
--warmup=5 \
--iter=20 \
--print_output=false \
--caffe2_threadpool_force_inline=true'
Milliseconds per iter: 367.055.
adb shell 'cd /data/local/tmp; \
./speed_benchmark_torch_eigen \
--model=mobilenetv2.pk \
--input_dims="1,3,224,224" \
--input_type=float \
--warmup=5 \
--iter=20 \
--print_output=false \
--caffe2_threadpool_force_inline=true'
Milliseconds per iter: 348.77.
```
Differential Revision: D17489587
fbshipit-source-id: efe542db810a900f680da7ec7e60f215f58db66e