onnxruntime
a7178fd8 - Move buffer release or cache from OnRefresh to ReleaseBuffer in BucketCacheManager (#25276)

Commit
163 days ago
Move buffer release or cache from OnRefresh to ReleaseBuffer in BucketCacheManager (#25276) ### Description <!-- Describe your changes. --> This PR is to move buffer release or cache from OnRefresh to ReleaseBuffer in BucketCacheManager. ### Motivation and Context The OnRefresh is executed after a batch(16) ep runs and inside the batch runs, the buffer can not be really reused which is a waste for gpu buffer resources. This PR proposed a strightforward optimization that release or cache the buffer early in ReleaseBuffer instead of OnRefresh to improve the buffer cache or release efficiency which will improve the peak and average GPU memory usage. The experimental result also shows a reasonable memory optimization without perf regressions. #### Phi3 Optimization Strategy | Peak Memory (MB) | Avg Memory (MB) | Token Gen Latency (ms) | Tokens/sec -- | -- | -- | -- | -- Default Bucket | 3603.83 | 3127.05 | 7.17 | 139.50 Default Bucket with Early Release Optimization | 3534.77 (+1.92%) | 3073.97 (+1.70%) | 7.14 (+0.36%) | 140.01 (+0.36%) #### Deepseek-R1 Optimization Strategy | Peak Memory (MB) | Avg Memory (MB) | Token Gen Latency (ms) | Tokens/sec -- | -- | -- | -- | -- Default Bucket | 2089.03 | 1716.15 | 6.07 | 164.67 Default Bucket with Early Release Optimization | 2034.00 (+2.63%) | 1674.49 (+2.43%) | 6.09 (-0.20%) | 164.34 (-0.20%) #### LLama3.2-1B Optimization Strategy | Peak Memory (MB) | Avg Memory (MB) | Token Gen Latency (ms) | Tokens/sec -- | -- | -- | -- | -- Default Bucket | 1736.03 | 1424.64 | 3.37 | 296.53 Default Bucket with Early Release Optimization | 1659.78 (+4.39%) | 1366.78 (+4.06%) | 3.41 (-1.09%) | 293.34 (-1.08%)
Author
Parents
Loading