SemanticDiff

pytorch
9696f06b - Use __ldg for CUDA kernels in fuser (#18540)

Commit View On GitHub

Login via GitHub
Home
Pricing
FAQ
Install

Login via GitHub

Commit

5 years ago

Use __ldg for CUDA kernels in fuser (#18540) Summary: While benchmarking a kernel with broadcasted inputs, I noticed that is was much slower than a hand-coded kernel for the smae task. The kernel in question computed a * b + c for a of shape 32 x 32 x 10240 and b and c of shape 1 x 32 x 1. This patch accellerates said kernel from 450us to 250us on my GTX1080Ti. I didn't change half because there doesn't seem to be __ldg for half. An alternative could be to sprinkle const and restrict. Pull Request resolved: https://github.com/pytorch/pytorch/pull/18540 Differential Revision: D14657840 Pulled By: soumith fbshipit-source-id: 408847346ec12d1d1d9b119ac50bbc70f0d9ed33

Author

t-vi

t-vi

Committer

facebook-github-bot

facebook-github-bot

Parents

FAQ Terms Privacy Refunds Impressum

Loading