[ROCm] CatArrayBatchedCopy performance improvement (#118685)
Tune the grid and block sizes for ROCm. Add a contig kernel separate from aligned+contig.
Verified new performance using pytorch/benchmarks/operator_benchmark.
`python -m pt.cat_test --device=cuda --tag-filter all`
On MI200 this improved performance on average 4%, and on MI300 14%.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118685
Approved by: https://github.com/malfet