Performance improvements for depthwise convolutions in FP16 (#22302)
Summary:
This PR activates faster depthwise convolution kernels for Volta and Turing GPUs using cudnn >= 7600.
The script to benchmark the current PyTorch master branch and this PR branch can be found [here](https://gist.github.com/ptrblck/4590cf20721d8f43296c9903abd4a774).
(50 warmup iterations, 1000 iterations for timing)
I've used https://github.com/pytorch/pytorch/issues/3265 to create a similar benchmark and added a few additional setups.
Since the results are quite long, I've uploaded them in a spreadsheet [here](https://docs.google.com/spreadsheets/d/13ByXcqg7LQUr3DVG3XpLwnJ-CXg3GUZJ3puyTMw9n2I/edit?usp=sharing).
Times are given in ms per iteration.
We've benchmarked this PR on a DGX1 using V100 GPUs.
The current workload check in `check_cudnn_depthwise_workload` is quite long and can be moved to another file, if wanted.
CC ngimel (Thanks for the support while benchmarking it ;) )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22302
Differential Revision: D16115057
Pulled By: ezyang
fbshipit-source-id: bad184658518e73b4d6b849d77e408f5a7a757de