Add reducemin, reducemax nodes before the 1st consumer of tensor (#26768)
##Significantly reduces peak memory usage during minmax calibration
##Description
During MinMax Calibration, ReduceMin and ReduceMax nodes were added at
the end of the node list. As a result, in the topological order of
execution these nodes were coming at last since they didn't have any
consumer. Now because of this, all the intermediate tensors were not
getting freed up, occupying the memory till reducemin, reducemax nodes
consume them. This PR aims to reorder the node list such that in
topological order reducemin, reducemax nodes are executed before the
original first consumer of that tensor. This way the memory will be
freed as soon as the original first consumer consumes the tensor.
##Motivation and Context
During MinMax calibration for larger LLMs like phi4 14b etc, even 80 gb
A100 gpus were not sufficient to do the calibration. It always resulted
in CUDA OOM error even before 1st inference completes. This PR aims to
address this issue. It aims to significantly reduce the peak memory
requirements during minmax calibration.
Co-authored-by: Ronak Mahawar <rmahawar@qti.qualcomm.com>