[te] Optimize allocation of kernel outputs (#50318)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50318
We can skip the dispatcher and go to the device-specific
`at::native::empty_strided` implementation.
Also, unpacking the TensorOptions struct at kernel launch time actually takes a
bit of work, since the optionals are encoded in a bitfield. Do this upfront
and use the optionals directly at runtime.
ghstack-source-id: 119735738
Test Plan:
Before:
```
-------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------
FusedOverhead 2143 ns 2142 ns 332946
UnfusedOverhead 2277 ns 2276 ns 315130
```
After:
```
-------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------
FusedOverhead 2175 ns 2173 ns 321877
UnfusedOverhead 2394 ns 2394 ns 307360
```
(The noise in the baseline makes this really hard to read, it seemed to be
about 3-5% faster in my local testing)
Reviewed By: eellison
Differential Revision: D25859132
fbshipit-source-id: 8753289339e365f78c790bee076026cd649b8509