Improve make_tensor performance for float and complex types (#85473)
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation from `low` to `high`. This instead calls `uniform_(low,
high)` to cut out the interpolation step.
For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This instead
uses `view_as_real` and `uniform_(low, high)` to fuse it all into one
operation.
My benchmarks show significant speedups in all cases for float32 and
complex64.
| Device | dtype | Size | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-------------|--------------|---------|
| CPU | float32 | 8 | 19.4 | 6.34 | 3.1 |
| | | 4096 | 36.8 | 21.3 | 1.7 |
| | | 2**24 | 167,000 | 80,500 | 2.1 |
| | complex32 | 8 | 37.0 | 7.57 | 4.9 |
| | | 4096 | 73.1 | 37.6 | 1.9 |
| | | 2**24 | 409,000 | 161,000 | 2.5 |
| CUDA | float32 | 8 | 40.4 | 11.7 | 3.5 |
| | | 4096 | 38.7 | 11.7 | 3.3 |
| | | 2**24 | 2,300 | 238 | 9.7 |
| | complex32 | 8 | 78.7 | 14 | 5.6 |
| | | 4096 | 82.7 | 13.8 | 6.0 |
| | | 2**24 | 5,520 | 489 | 11.3 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85473
Approved by: https://github.com/mruberry