[Inductor] Support vectorized transpose in CPP backend (#91532)
Fix https://github.com/pytorch/torchdynamo/issues/1915
This PR adds the vectorization support for transposed operations in TorchInductor CPP backend. It contains the following changes:
1. `CppTile2DKernelChecker` is added to check the eligibility of applying the optimization. We only addresss a narrow set of situations. All of the following conditions should be met: 1) There exists one and only one fp32 load/store with outer loop var having contiguous buffer accesses. 2) When a load/store doesn't have contiguous access in an outer loop var, the access should be vectorizable from the inner-most dim. 3) No reduction. More scenarios/operations would be supported in the future PRs.
2. If `CppTile2DKernelChecker` reports the optimization is doable, `CppKernelProxy` would split/tile the loops from both the outer loop var having contiguous buffer access and the inner-most loop var.
3. The main loop split from the outer loop var is further split at the inner-most level and then handled by `CppTile2DKernel` and `CppTile2DTailKernel` which generate the transposed load/store. The former kernel does the vectorized transposed load/store on tiles and then does vectorized load/store/compute along the inner-most loop axis. The vectorized transpose micro-kernel implementation borrows/refers to that from FBGEMM. The latter kernel simply does scalar operations.
4. The tail loop split from the outer loop var directly calls `CppKernel` with scalar operations.
Next steps:
1. Support vectorized transpose with smaller tile size at one dim but bigger tile size at the other, e.g., 3x784.
2. Support reduction vectorized on the outer loop var (contiguous from outer loop var, not with inner-most loop var)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91532
Approved by: https://github.com/EikanWang, https://github.com/jansel