pytorch
af1a2df8 - enable better depthwise conv perf on cudnn 8.2+ (#58749)

Commit

3 years ago

enable better depthwise conv perf on cudnn 8.2+ (#58749) Summary: There are multiple improvement of depthwise convolution speed in cudnn between 7.6 and 8.2, since https://github.com/pytorch/pytorch/pull/22302. This PR aim to harvest all the new improvement by enable more cudnn kernel. The workload checking logic can also be simplified now. To keep the change simple, I kept things before cudnn 8.2 unchanged. Similar to https://github.com/pytorch/pytorch/pull/22302, I used a script [here](https://gist.github.com/FDecaYed/e8ba98a95cd33697df2ace86fdb44897) to benchmark. Both run are using cudnn 8.2 One enhancement I did to the script is switch to event based timing. With warmup kernels to fill the launch queue ahead, this should give us accurate kernel timing even in CPU launch bound cases. Here is A100 and V100 result sorted by speedup. [Book1.xlsx](https://github.com/pytorch/pytorch/files/6530371/Book1.xlsx) Result highlights: Newly turned on 5x5 cudnn kernel show up to 6x speedup. Close to half of test sizes show >10% speedup. Fixed some corner cases that previously caused 15-20x slowdown. Only slowdown a handful of cases(~10 out of >1000) Pull Request resolved: https://github.com/pytorch/pytorch/pull/58749 Reviewed By: bdhirsh Differential Revision: D31613199 Pulled By: ngimel fbshipit-source-id: 883b58facad67ccd51dc9ab539368b4738d40398

Author

FDecaYed

Committer

facebook-github-bot

Parents

cf3a5160

pytorch af1a2df8 - enable better depthwise conv perf on cudnn 8.2+ (#58749)

pytorch
af1a2df8 - enable better depthwise conv perf on cudnn 8.2+ (#58749)