enable better depthwise conv perf on cudnn 8.2+ (#58749)
Summary:
There are multiple improvement of depthwise convolution speed in cudnn between 7.6 and 8.2, since https://github.com/pytorch/pytorch/pull/22302.
This PR aim to harvest all the new improvement by enable more cudnn kernel. The workload checking logic can also be simplified now.
To keep the change simple, I kept things before cudnn 8.2 unchanged.
Similar to https://github.com/pytorch/pytorch/pull/22302, I used a script [here](https://gist.github.com/FDecaYed/e8ba98a95cd33697df2ace86fdb44897) to benchmark. Both run are using cudnn 8.2
One enhancement I did to the script is switch to event based timing. With warmup kernels to fill the launch queue ahead, this should give us accurate kernel timing even in CPU launch bound cases.
Here is A100 and V100 result sorted by speedup.
[Book1.xlsx](https://github.com/pytorch/pytorch/files/6530371/Book1.xlsx)
Result highlights:
Newly turned on 5x5 cudnn kernel show up to 6x speedup.
Close to half of test sizes show >10% speedup.
Fixed some corner cases that previously caused 15-20x slowdown.
Only slowdown a handful of cases(~10 out of >1000)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58749
Reviewed By: bdhirsh
Differential Revision: D31613199
Pulled By: ngimel
fbshipit-source-id: 883b58facad67ccd51dc9ab539368b4738d40398