[cuDNN] Add a new optimized cuDNN RNN algorithm for small RNN hidden_size (#62143)
Summary:
This PR enables a new cuDNN RNN/LSTM algorithm `CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H` when the hidden_size is small. Operator benchmark observes 10x performance improvement in some shapes.
- [X] forward https://github.com/xwang233/code-snippet/tree/master/cudnn-rnn-bench-62143/forward
- [X] backward https://github.com/xwang233/code-snippet/tree/master/cudnn-rnn-bench-62143/backward
- [X] end-to-end model: benchmark looks good
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62143
Reviewed By: anjali411
Differential Revision: D33771442
Pulled By: ngimel
fbshipit-source-id: 0640abc6b90ebd2428c3182ce03bf0b9c30a2ec9
(cherry picked from commit 73b153a528fb9b64b994c1174882bc2f64b1ed47)