[cuda] Limit grid size for torch.cat kernel on aligned16 contig tensors (#103233)
When torch.cat gets called on a list of contiguous tensors that are aligned on a 16B boundary in memory, the number of thread blocks used is directly proportional with the maximum size of the tensors in the list. If one or more tensors are very large while the others are small, a high number of thread blocks results in useless redundant loads of the input metadata. This PR limits the grid size and improves the performance of cat when used on list of tensors with large variations in size.
Used the same test program from https://github.com/pytorch/pytorch/pull/102815 but added new cases with list of tensors with varying sizes.
<img width="735" alt="Screenshot 2023-06-07 at 10 14 18 PM" src="https://github.com/pytorch/pytorch/assets/23515689/72d0e5cb-5840-400e-b53b-d1418e664f19">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103233
Approved by: https://github.com/malfet