[cuDNN v8 API] cuDNN benchmark, convolution bwd / transposed convolution fwd, `bfloat16`, conv-bias-activation fusion (#60755)
Summary:
https://github.com/pytorch/pytorch/issues/58414, https://github.com/pytorch/pytorch/issues/58859, https://github.com/pytorch/pytorch/issues/58858 #58860 https://github.com/pytorch/pytorch/issues/58861
We're currently testing performance with both "find" and "get" with this PR.
CC zasdfgbnm ptrblck ngimel puririshi98
In addition to the `USE_EXPERIMENTAL_CUDNN_V8_API` build flag, we've added a `CUDNN_V8_API_ENABLED` runtime feature flag.
`USE_EXPERIMENTAL_CUDNN_V8_API=1` will build with v8 API support while keeping all v7 functionality, with v8 usage disabled by default.
`CUDNN_V8_API_ENABLED=1` at runtime on a `USE_EXPERIMENTAL_CUDNN_V8_API=1` build uses the v8 API.
A debug flag `CUDNN_V8_API_DEBUG=1` can be used to verify which API is used when dispatching convolutions.
Note that in v7, `bfloat16` convolutions will dispatch to a native PyTorch implementation, but a fully v8 enabled build will dispatch to cuDNN implementations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60755
Reviewed By: mruberry
Differential Revision: D34393940
Pulled By: ngimel
fbshipit-source-id: 5c317d3aad63336ea416a51a43cf8b7d27aaca21
(cherry picked from commit 3bfc549ce57cee691f83dc894ac7adb4b7882459)