Add col2im_batched kernel (#84543)

Commit

2 years ago

Add col2im_batched kernel (#84543) Closes #84407 This changes col2im on CUDA to launch a single batch-aware kernel instead of launching n single slice kernels. The `istft` call in the linked issue goes from 98.7 ms to 858 us on my machine, for an over 100x speedup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84543 Approved by: https://github.com/ngimel