cuda graph enhancement (#19636)

Commit

2 years ago

cuda graph enhancement (#19636) ### Description  1. add a config key in run_options to control cuda graph in runtime. 2. enhance cuda graph class to support mutiple graph saving and retrieving in one ORT session 3. provide model modification/inference example on Phi2 4. benchmark shows an average of 13% latency reduction in token generation. limitation: TRT ep and ROCM ep hasn't applied this feature. we can revisit this in the future. ### Motivation and Context

References

wangye/eps

#19636 - cuda graph enhancement

Author

gh-yewang

Parents

bff4f8bf

onnxruntime 72ce4de0 - cuda graph enhancement (#19636)

onnxruntime
72ce4de0 - cuda graph enhancement (#19636)