istft: Use unfold_backward instead of col2im (#88060)
`unfold_backward` implements the same operation as `col2im` but without support
for 2d kernels or dilation. However, `istft` doesn't use any of those features
and `unfold_backward` actually has a faster `TensorIterator` based
implementation so we should use it here instead.
In the example from #87353 I see a 2x speedup on both CPU and CUDA.
On a wider variety of sizes and inputs I still see speedups across the board, especially
on CPU since `col2im` isn't parallelized but `unfold_backward` is:
| device | shape | hop_length | Master (us) | This PR (us) | Speedup |
|--------|-----------------|------------|-------------|--------------|---------|
| CUDA | (1, 129, 33) | 256 | 147 | 136 | 1.08 |
| | | 128 | 153 | 128 | 1.20 |
| | (100, 129, 20) | 256 | 181 | 147 | 1.23 |
| | | 128 | 171 | 137 | 1.25 |
| | (1000, 129, 10) | 256 | 681 | 443 | 1.55 |
| | | 128 | 632 | 446 | 1.42 |
| CPU | (1, 129, 33) | 256 | 106 | 104 | 1.02 |
| | | 128 | 103 | 81 | 1.27 |
| | (100, 129, 20) | 256 | 2400 | 399 | 6.02 |
| | | 128 | 2150 | 313 | 6.87 |
| | (1000, 129, 10) | 256 | 13800 | 3740 | 3.69 |
| | | 128 | 12700 | 2110 | 6.02 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88060
Approved by: https://github.com/albanD