Add col2im_batched kernel (#84543)
Closes #84407
This changes col2im on CUDA to launch a single batch-aware kernel
instead of launching n single slice kernels.
The `istft` call in the linked issue goes from 98.7 ms to 858 us on my
machine, for an over 100x speedup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84543
Approved by: https://github.com/ngimel