CUDA fast path for `_chunk_cat()` (#120678)
This PR provides CUDA fast path implementation for ATen Op `_chunk_cat` (#121081).
Performance on a production benchmark:
- Float16 in, Float16 out: 249 -> 500
- BFloat16 in, BFloat16 out: 248 -> 500
- BFloat16 in, Float32 out: 126 -> 278
- Float32 in, Float32 out: 153 -> 260
- Float64 in, Float64 out: 79 -> 132
- int8 in, int8 out: 332 -> 908
- int16 in, int16 out: 250 -> 489
- int32 in, int32 out: 153 -> 260
- int64 in, int64 out: 79 -> 132
Unit: Billion elements per second. Hardware: H100. Baseline: [Existing FSDP implementation](https://github.com/pytorch/pytorch/blob/7b3febdca7ad90aaf64d5b959d65364dc28c7424/torch/distributed/_composable/fsdp/_fsdp_collectives.py#L176)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120678
Approved by: https://github.com/yifuwang