[Vulkan] Optimize LSTM operator with pre-packing (#79702)
Summary:
Optimized LSTM operator by using pre-packing for weights and biases in the Vulkan GPU backend
- The weights and biases are always on the CPU side by design.
- The packed and unpacked data are stored in a VulkanOpContext
- Ops:
- `at::native::vulkan::ops::create_lstm_context`: Creates a VulkanOpContext object with the packed and unpacked data, and returns a pointer to it.
- `at::native::vulkan::ops::run_lstm_context`: Takes in the three input vulkan tensors (input sequence, initial hidden state and initial cell state) and a pointer to the context, and runs the LSTM operation.
- Registered the ops in [Register.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/ops/Register.cpp).
- Rewrote the subgraph function of LSTM in [vulkan_rewrite.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/passes/vulkan_rewrite.cpp) so that `create_lstm_context` and `run_lstm_context` can be executed instead in the Vulkan GPU backend.
- Added new test for the LSTM pre-packing and run ops: `lstm_prepack_success`
Test Plan: buck run //xplat/caffe2:pt_vulkan_api_test_binAppleMac
Reviewed By: SS-JIA
Differential Revision: D37052597
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79702
Approved by: https://github.com/SS-JIA