Enable oneDNN implementation in LSTM op (#91158)
### Description
This PR is to enable oneDNN implementation in LSTM op to improve the performance of it. Both FP32 and BF16 are supported.
### Performance improvement
In CPX 28C, with setting iomp and jemalloc.
We choose 8 LSTM input options (including input_size, hidden_size, num_layers, bidirectional, bias, batch_first, dropout, batch_size, seq_len), and the final option is a real input from train-clean-100 in LibriSpeech dataset. The performance improvements are shown in the following figures. We can see that LSTM with oneDNN implementation can perform better than the original.
In single socket:


In single core:


Pull Request resolved: https://github.com/pytorch/pytorch/pull/91158
Approved by: https://github.com/jgong5, https://github.com/malfet