[inductor] decompose memory bound mm (#120047)
Summary:
Decompose memory bound mm/bmm.
Linear decomposition result: D53502768
BMM decomposition result: D53148650
We should only decompose when
1)bmm, b is large, m,n,k is relative small
2)mm/addmm. m is large, n and K is relative small. e.g. mm of input gradient in linear backward should not be decomposed since m is small and n is large.
Need to conduct more experiments to see if we can find a better strategy for decomposition. I have tried to use a linear regression model (see the bento results) which does not fit well. For short term, we use heuristics to determine when to decompose.
Test Plan:
```
buck2 test mode/dev-nosan //caffe2/test/inductor:decompose_mem_bound_mm
```
COFFEE APS mc0:
baseline: aps-lsf-0124-bf16-267ccb7a0d
decompose: aps-lsf-0124-bf16-4e3824db40
FIRST AFOC pyper mc1
Differential Revision: D53602514
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120047
Approved by: https://github.com/mengluy0125