Split rnn primitive for inference and training (#96736)
## Description
Currently, both inference and training will use `forward_training` in rnn primitive, which will bring performance downgrade for inference (The performance drop is from rnn primitive and unnecessary creation of `pd` and `workspace`). This PR is to split them into `forward_inference` and `forward_training` seperately.
## Performance
With this fix PR, in RNN-T inference, the throughput reduction is 167 ms, which increases `3.7%` of E2E time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96736
Approved by: https://github.com/jgong5