Further parallelize linspace in addition to AVX (#38093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38093
Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136, Parallelization using OpenMP):
```
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
for n, t in [(40_000, 50000),
(400_000, 5000)]:
print(f'torch.linspace(0, 10, {n}, dtype={dtype}) for {t} times')
print(timeit.timeit(f'torch.linspace(0, 10, {n}, dtype={dtype})', setup=f'import torch', number=t))
```
With AVX
========
Before:
```
torch.linspace(0, 10, 40000, dtype=torch.double) for 50000 times
1.0942596640015836
torch.linspace(0, 10, 400000, dtype=torch.double) for 5000 times
0.9209065200011537
torch.linspace(0, 10, 40000, dtype=torch.float) for 50000 times
1.0520610109997506
torch.linspace(0, 10, 400000, dtype=torch.float) for 5000 times
0.9031864690005023
torch.linspace(0, 10, 40000, dtype=torch.uint8) for 50000 times
1.949299545998656
torch.linspace(0, 10, 400000, dtype=torch.uint8) for 5000 times
1.82629113800067
torch.linspace(0, 10, 40000, dtype=torch.int8) for 50000 times
1.9547776939980395
torch.linspace(0, 10, 400000, dtype=torch.int8) for 5000 times
1.8259895039991534
torch.linspace(0, 10, 40000, dtype=torch.int16) for 50000 times
2.759497356000793
torch.linspace(0, 10, 400000, dtype=torch.int16) for 5000 times
2.6285490109985403
torch.linspace(0, 10, 40000, dtype=torch.int32) for 50000 times
2.3456633150017296
torch.linspace(0, 10, 400000, dtype=torch.int32) for 5000 times
2.2031515989983745
torch.linspace(0, 10, 40000, dtype=torch.int64) for 50000 times
2.559069258000818
torch.linspace(0, 10, 400000, dtype=torch.int64) for 5000 times
2.378239962999942
```
After:
```
torch.linspace(0, 10, 40000, dtype=torch.double) for 50000 times
0.8100852870011295
torch.linspace(0, 10, 400000, dtype=torch.double) for 5000 times
0.18943897200006177
torch.linspace(0, 10, 40000, dtype=torch.float) for 50000 times
0.6679975400002149
torch.linspace(0, 10, 400000, dtype=torch.float) for 5000 times
0.17846923400065862
torch.linspace(0, 10, 40000, dtype=torch.uint8) for 50000 times
1.1431112539976311
torch.linspace(0, 10, 400000, dtype=torch.uint8) for 5000 times
0.3336703610002587
torch.linspace(0, 10, 40000, dtype=torch.int8) for 50000 times
1.157699686998967
torch.linspace(0, 10, 400000, dtype=torch.int8) for 5000 times
0.32964968899977976
torch.linspace(0, 10, 40000, dtype=torch.int16) for 50000 times
1.5379577429994242
torch.linspace(0, 10, 400000, dtype=torch.int16) for 5000 times
0.4638638729993545
torch.linspace(0, 10, 40000, dtype=torch.int32) for 50000 times
1.360489848000725
torch.linspace(0, 10, 400000, dtype=torch.int32) for 5000 times
0.4033017760011717
torch.linspace(0, 10, 40000, dtype=torch.int64) for 50000 times
1.4591587399991113
torch.linspace(0, 10, 400000, dtype=torch.int64) for 5000 times
0.44132660000104806
```
Without AVX
===========
Before:
```
torch.linspace(0, 10, 40000, dtype=torch.double) for 50000 times
3.4967273879992717
torch.linspace(0, 10, 400000, dtype=torch.double) for 5000 times
3.330881046000286
torch.linspace(0, 10, 40000, dtype=torch.float) for 50000 times
2.176502857997548
torch.linspace(0, 10, 400000, dtype=torch.float) for 5000 times
2.023505228000431
torch.linspace(0, 10, 40000, dtype=torch.uint8) for 50000 times
2.117801246000454
torch.linspace(0, 10, 400000, dtype=torch.uint8) for 5000 times
1.9885458380013006
torch.linspace(0, 10, 40000, dtype=torch.int8) for 50000 times
2.1057261179994384
torch.linspace(0, 10, 400000, dtype=torch.int8) for 5000 times
1.9809251260012388
torch.linspace(0, 10, 40000, dtype=torch.int16) for 50000 times
3.187070896001387
torch.linspace(0, 10, 400000, dtype=torch.int16) for 5000 times
3.049615387000813
torch.linspace(0, 10, 40000, dtype=torch.int32) for 50000 times
3.4874590049985272
torch.linspace(0, 10, 400000, dtype=torch.int32) for 5000 times
3.33596555099939
torch.linspace(0, 10, 40000, dtype=torch.int64) for 50000 times
4.256659758000751
torch.linspace(0, 10, 400000, dtype=torch.int64) for 5000 times
4.100936053000623
```
After:
```
torch.linspace(0, 10, 40000, dtype=torch.double) for 50000 times
1.9155298300029244
torch.linspace(0, 10, 400000, dtype=torch.double) for 5000 times
0.598213522000151
torch.linspace(0, 10, 40000, dtype=torch.float) for 50000 times
1.3183841649988608
torch.linspace(0, 10, 400000, dtype=torch.float) for 5000 times
0.40136947100108955
torch.linspace(0, 10, 40000, dtype=torch.uint8) for 50000 times
1.2191377319977619
torch.linspace(0, 10, 400000, dtype=torch.uint8) for 5000 times
0.35984685299990815
torch.linspace(0, 10, 40000, dtype=torch.int8) for 50000 times
1.2153874989999167
torch.linspace(0, 10, 400000, dtype=torch.int8) for 5000 times
0.35752785600197967
torch.linspace(0, 10, 40000, dtype=torch.int16) for 50000 times
1.750796647000243
torch.linspace(0, 10, 400000, dtype=torch.int16) for 5000 times
0.5376063230032742
torch.linspace(0, 10, 40000, dtype=torch.int32) for 50000 times
1.9153429929974664
torch.linspace(0, 10, 400000, dtype=torch.int32) for 5000 times
0.5952553579991218
torch.linspace(0, 10, 40000, dtype=torch.int64) for 50000 times
2.281823589000851
torch.linspace(0, 10, 400000, dtype=torch.int64) for 5000 times
0.7391443560009066
```
Differential Revision: D21528099
Test Plan: Imported from OSS
Pulled By: malfet
fbshipit-source-id: a6b3904e7860bb6d652a48b2056154509e73157d