[RELAND] [cuDNN] Add a new optimized cuDNN RNN algorithm for small RNN hidden_size (#73211)
Summary:
https://github.com/pytorch/pytorch/pull/62143 was reverted (https://github.com/pytorch/pytorch/pull/72089) because, when running native tests internally with cudnn and GPUs such that `CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H` was used, we hit some `CUDNN_STATUS_NOT_SUPPORTED` errors.
Based on https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#features-of-rnn-functions and experiments, I strongly suspect the errors were because `CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H` doesn't support variable sequence lengths in the batch.
This PR restores https://github.com/pytorch/pytorch/pull/62143 and adds a bailout condition if the input is a packed batch that might have different sequence lengths per element.
Question for review: Do we also need to add a bailout condition if the input is double precision?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73211
Reviewed By: ejguan
Differential Revision: D34688016
Pulled By: ngimel
fbshipit-source-id: e7335c4701dabc7d0b36ebdb6414c4353a71ee91
(cherry picked from commit b9023bfd1c31eb9a38bf0552a20412e9a4e60b91)