Fix the performance issue that the for-loop before ExternallCall could not be parallelized. (#85056) (#86516)
Currently, NNC only parallelizes the loop statement of the graph outputs. The logic could bypass some loop statements that could be parallelized. Take an example as follows and suppose the output of `ExternallCall` is also the output of NNC fusion group. Current [parallel logic](https://github.com/pytorch/pytorch/pull/85056/files#diff-9a11174c26e4b57ab73e819520122bc314467c72962f3a5b79e7400ea3c4bbe5L781-L785) only tries to parallel the `ExternalCall` and bypass `stmt1` and `stmt2`.
```c++
stmt1: For:
stmt2: For:
stmt3: ExternalCall
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85056
Approved by: https://github.com/frank-wei, https://github.com/bertmaher