onnxruntime
da68aa89 - Add reducemin, reducemax nodes before the 1st consumer of tensor (#26768)

Commit

90 days ago

Add reducemin, reducemax nodes before the 1st consumer of tensor (#26768) ##Significantly reduces peak memory usage during minmax calibration ##Description During MinMax Calibration, ReduceMin and ReduceMax nodes were added at the end of the node list. As a result, in the topological order of execution these nodes were coming at last since they didn't have any consumer. Now because of this, all the intermediate tensors were not getting freed up, occupying the memory till reducemin, reducemax nodes consume them. This PR aims to reorder the node list such that in topological order reducemin, reducemax nodes are executed before the original first consumer of that tensor. This way the memory will be freed as soon as the original first consumer consumes the tensor. ##Motivation and Context During MinMax calibration for larger LLMs like phi4 14b etc, even 80 gb A100 gpus were not sufficient to do the calibration. It always resulted in CUDA OOM error even before 1st inference completes. This PR aims to address this issue. It aims to significantly reduce the peak memory requirements during minmax calibration. Co-authored-by: Ronak Mahawar <rmahawar@qti.qualcomm.com>

References

#26768 - Add reducemin, reducemax nodes before the 1st consumer of tensor

Author

rM-planet

Parents

5a5c8913

onnxruntime da68aa89 - Add reducemin, reducemax nodes before the 1st consumer of tensor (#26768)

onnxruntime
da68aa89 - Add reducemin, reducemax nodes before the 1st consumer of tensor (#26768)