SemanticDiff

pytorch
aaff8d27 - CUDA fast path for `_chunk_cat()` (#120678)

Commit View On GitHub

Login via GitHub
Home
Pricing
FAQ
Install

Login via GitHub

Commit

189 days ago

CUDA fast path for `_chunk_cat()` (#120678) This PR provides CUDA fast path implementation for ATen Op `_chunk_cat` (#121081). Performance on a production benchmark: - Float16 in, Float16 out: 249 -> 500 - BFloat16 in, BFloat16 out: 248 -> 500 - BFloat16 in, Float32 out: 126 -> 278 - Float32 in, Float32 out: 153 -> 260 - Float64 in, Float64 out: 79 -> 132 - int8 in, int8 out: 332 -> 908 - int16 in, int16 out: 250 -> 489 - int32 in, int32 out: 153 -> 260 - int64 in, int64 out: 79 -> 132 Unit: Billion elements per second. Hardware: H100. Baseline: [Existing FSDP implementation](https://github.com/pytorch/pytorch/blob/7b3febdca7ad90aaf64d5b959d65364dc28c7424/torch/distributed/_composable/fsdp/_fsdp_collectives.py#L176) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120678 Approved by: https://github.com/yifuwang

Author

BoyuanFeng

BoyuanFeng

Committer

pytorchmergebot

pytorchmergebot

Parents

FAQ Terms Privacy Refunds Impressum

Loading