pytorch
320c3855 - Refactor CUDA copy and general copy dispatch (#20685)

Commit View On GitHub

Commit

5 years ago

Refactor CUDA copy and general copy dispatch (#20685) Summary: Copy.cu goes from 308 to 190 lines of code. In general it uses, the same copy strategy, using cudaMempcyAsync, a pointwise kernel, or a copy using temporary buffers. The pointwise kernel has slightly improved performance when broadcasting due to faster index calculation. This deletes "`s_copy_`", "`_s_copy_from`", and "`_copy_same_type_`". The only entry-point now is "`copy_`". A mini-benchmark is here: https://gist.github.com/colesbury/706de1d4e8260afe046020988410b992 Before: https://gist.github.com/colesbury/ab454b6fe3791bff420d7bcf8c041f18 After: https://gist.github.com/colesbury/9024d242b56ab09a9ec985fa6d1620bc Results were measured on 2.2 GHz Broadwell; no-turbo; one thread; compiled with GCC 7.3.0. (Results are slower than typical usage due to turbo being off.) The only significant differences is in the CUDA [1024] -> [1024, 1024] broadcasting copy which is ~25% faster. I don't expect a noticeable difference in real programs. CPU copy overhead is a tiny bit (~200 ns) faster, but I don't expect anyone to notice that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/20685 Differential Revision: D15414819 Pulled By: colesbury fbshipit-source-id: d3c6e04a5020470e3bef15b1fc09503cae5df440

Author

colesbury

Committer

facebook-github-bot

Parents

cf7ef5e6

pytorch 320c3855 - Refactor CUDA copy and general copy dispatch (#20685)

Commit

pytorch
320c3855 - Refactor CUDA copy and general copy dispatch (#20685)