reimplement __torch_function__ overrides for torch.functional using inline logic (#32194)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/30831.
This improves the performance of operators in the `torch.functional` namespace that are overridable by `__torch_function__` implementations when supplied with `Tensor` operands.
Running the split benchmark in various configurations produces the following timings:
<details>
<summary>Expand for timings on <code>master</code> </summary>
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M8_N8_parts2_cpu
# Input: M: 8, N: 8, parts: 2, device: cpu
Forward Execution Time (us) : 3.340
# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M8_N8_parts2_cuda
# Input: M: 8, N: 8, parts: 2, device: cuda
Forward Execution Time (us) : 3.333
# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M256_N512_parts2_cpu
# Input: M: 256, N: 512, parts: 2, device: cpu
Forward Execution Time (us) : 3.366
# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M256_N512_parts2_cuda
# Input: M: 256, N: 512, parts: 2, device: cuda
Forward Execution Time (us) : 3.385
# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M512_N512_parts2_cpu
# Input: M: 512, N: 512, parts: 2, device: cpu
Forward Execution Time (us) : 3.468
# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M512_N512_parts2_cuda
# Input: M: 512, N: 512, parts: 2, device: cuda
Forward Execution Time (us) : 3.416
```
</details>
<details>
<summary>Expand for timings with this pull request applied</summary>
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M8_N8_parts2_cpu
# Input: M: 8, N: 8, parts: 2, device: cpu
Forward Execution Time (us) : 2.261
# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M8_N8_parts2_cuda
# Input: M: 8, N: 8, parts: 2, device: cuda
Forward Execution Time (us) : 2.223
# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M256_N512_parts2_cpu
# Input: M: 256, N: 512, parts: 2, device: cpu
Forward Execution Time (us) : 2.237
# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M256_N512_parts2_cuda
# Input: M: 256, N: 512, parts: 2, device: cuda
Forward Execution Time (us) : 2.218
# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M512_N512_parts2_cpu
# Input: M: 512, N: 512, parts: 2, device: cpu
Forward Execution Time (us) : 2.259
# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M512_N512_parts2_cuda
# Input: M: 512, N: 512, parts: 2, device: cuda
Forward Execution Time (us) : 2.234
```
</details>
<details>
<summary>Expand for timings on <code>master</code> with <code>__torch_function__</code> dispatch disabled </summary>
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M8_N8_parts2_cpu
# Input: M: 8, N: 8, parts: 2, device: cpu
Forward Execution Time (us) : 2.180
# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M8_N8_parts2_cuda
# Input: M: 8, N: 8, parts: 2, device: cuda
Forward Execution Time (us) : 2.172
# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M256_N512_parts2_cpu
# Input: M: 256, N: 512, parts: 2, device: cpu
Forward Execution Time (us) : 2.171
# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M256_N512_parts2_cuda
# Input: M: 256, N: 512, parts: 2, device: cuda
Forward Execution Time (us) : 2.146
# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M512_N512_parts2_cpu
# Input: M: 512, N: 512, parts: 2, device: cpu
Forward Execution Time (us) : 2.175
# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M512_N512_parts2_cuda
# Input: M: 512, N: 512, parts: 2, device: cuda
Forward Execution Time (us) : 2.152
```
</details>
So at least on the machine I'm testing on, this brings the overhead down to less than 100 ns. For comparison, the overhead for `__array_function__` in NumPy is about 850 ns on the same machine.
<details>
<summary>Expand for timings for NumPy <code>__array_function__</code> dispatch </summary>
```
In [1]: import numpy as np
In [2]: %timeit np.mean([1])
8.89 µs ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: %timeit np.mean._implementation([1])
8.04 µs ± 28.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
See [the implementation in NumPy](https://github.com/numpy/numpy/blob/master/numpy/core/overrides.py#L195) for why this measures `__array_function__` overhead.
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32194
Differential Revision: D19410396
Pulled By: ezyang
fbshipit-source-id: ada788a4399c81cd7eb2d548aa04a2459e96634a