Enable oneDNN implementation in LSTM op (#91158)
### Description
This PR is to enable oneDNN implementation in LSTM op to improve the performance of it. Both FP32 and BF16 are supported.
### Performance improvement
In CPX 28C, with setting iomp and jemalloc.
We choose 8 LSTM input options (including input_size, hidden_size, num_layers, bidirectional, bias, batch_first, dropout, batch_size, seq_len), and the final option is a real input from train-clean-100 in LibriSpeech dataset. The performance improvements are shown in the following figures. We can see that LSTM with oneDNN implementation can perform better than the original.
In single socket:
![image](https://user-images.githubusercontent.com/61222868/211182994-833debec-518a-4b35-8504-6b0fadb17930.png)
![image](https://user-images.githubusercontent.com/61222868/211183012-31e1253f-2c60-4c92-a656-c239a971b453.png)
In single core:
![image](https://user-images.githubusercontent.com/61222868/211183017-186e5d47-cb9a-4c1e-914f-fa718e769f1c.png)
![image](https://user-images.githubusercontent.com/61222868/211183022-53266857-5a9e-4a95-b300-33fa34811d08.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91158
Approved by: https://github.com/jgong5, https://github.com/malfet