[Inductor][FX passes] New group/batch fusion pattern searching algorithm + group mm fusion + preserve memory layout (#106279)
Summary:
Major changes:
* Implement a new group/batch fusion pattern searching algorithm: only fuse patterns that are in a certain depth difference (locally).
* Search FX graph in reverse order since most of ops have more inputs than outputs.
* Support fuse mm (linear backward)
* Preserve memory layout for fbgemm.gmm.
We tested in Ads models and saw consistent gains.
Test Plan: Unit tests and integration test.
Differential Revision: D47581710
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106279
Approved by: https://github.com/jansel, https://github.com/Skylion007