[te] Create TargetMachine only once with correct options to fix perf (#50406)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50406
We were creating different TMs in PytorchLLVMJIT and LLVMCodeGen; the
one in LLVMCodeGen had the right target-specific options to generate fast AVX2
code (with FMAs, vbroadcastss, etc.), and that's what was showing up in the
debug output, but the LLVMJIT TM was the one that actually generated runtime
code, and it was slow.
ghstack-source-id: 119700110
Test Plan:
```
buck run mode/opt //caffe2/benchmarks/fb/tensorexpr:tensorexpr_bench
```
With this diff NNC is getting at least somewhat (5%) close to Pytorch with MKL,
for at least this one small-ish test case"
```
Run on (24 X 2394.67 MHz CPU s)
2021-01-11 15:57:27
----------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------------
Gemm/Torch/128/128/128 65302 ns 65289 ns 10734 GFLOPS=64.2423G/s
Gemm/TensorExprTile4x16VecUnroll/128/128/128 68602 ns 68599 ns 10256 GFLOPS=61.1421G/s
```
Reviewed By: bwasti
Differential Revision: D25877605
fbshipit-source-id: cd293bac94d025511f348eab5c9b8b16bf6505ec