Use a pool of per-thread cudnn handles for each device, updated (#15080)
Summary:
Rebased version of https://github.com/pytorch/pytorch/pull/14861, hopefully addressing ezyang's comments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15080
Differential Revision: D13440858
Pulled By: ezyang
fbshipit-source-id: 1c6af5c53538b81c6b92cf1dda231ed333f28035