Adapt memory optimizer to fit PHI2 (#19757)
### Adapt memory optimizer to fit PHI2
Few improvements and bug fixes:
1. Fix bug related to transformer layer detection.
2. Use default reversed typo order to create recompute node, to avoid
the leaf nodes are handled too late, then having lowest priority for
execution.
3. Add early stop when activation's element count is constant and total
element count < 1M. This can avoid overhead to search subgraphs.
Using export ORTMODULE_MEMORY_OPT_LEVEL=1 to enable layerwise recompute,
on given recipe, memory consumption dropped from ~22GB to ~13GB .