[inductor][fx passes]batch linear in pre grad (#107759)
Summary:
After we compile dense arch, we observe split-linear-cat pattern. Hence, we want to use bmm fusion + split cat pass to fuse the pattern as torch.baddmm.
Some explanation why we prefer pre grad:
1) We need to add bmm fusion before split cat pass which is in pre grad pass to remove the new added stack and unbind node with the original cat/split node
2) Post grad does not support torch.stack/unbind. There is a hacky workaround but may not be landed in short time.
Test Plan:
# unit test
```
buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
[jackiexu0313@devgpu005.cln5 ~/fbsource/fbcode (f0ff3e3fc)]$ buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
File changed: fbcode//caffe2/test/inductor/test_group_batch_fusion.py
Buck UI: https://www.internalfb.com/buck2/189dd467-d04d-43e5-b52d-d3b8691289de
Test UI: https://www.internalfb.com/intern/testinfra/testrun/5910974704097734
Network: Up: 0B Down: 0B
Jobs completed: 14. Time elapsed: 1:05.4s.
Tests finished: Pass 5. Fail 0. Fatal 0. Skip 0. Build failure 0
```
# local test
```
=================Single run start========================
enable split_cat_pass for control group
================latency analysis============================
latency is : 73.79508209228516 ms
=================Single run start========================
enable batch fusion for control group
enable split_cat_pass for control group
================latency analysis============================
latency is : 67.94447326660156 ms
```
# e2e test
todo add e2e test
Differential Revision: D48539721
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107759
Approved by: https://github.com/yanboliang