Add memory stats to profiling (#29058)

Commit

12 days ago

Add memory stats to profiling (#29058) This pull request introduces enhanced memory profiling capabilities by adding a new metric, `bytes_requested_in_use`, to allocator statistics throughout the ONNX Runtime codebase. This metric tracks the memory actually requested by user code, excluding internal fragmentation and padding, and is now reported alongside existing memory usage statistics. The changes span core framework allocators, CUDA providers, plugin interfaces, and kernel execution profiling. **Allocator statistics improvements:** * Added a new field, `bytes_requested_in_use`, to the `AllocatorStats` struct in multiple locations (core, CUDA, plugin, and test), which tracks the number of bytes actually requested by user code, distinct from total bytes in use that may include internal padding. This field is now initialized, serialized, and included in string/key-value representations. [[1]](diffhunk://#diff-6838c5ae83d3adbfc970dc7631571a2708561bbcae4672068116908d076c1250L17-R18) [[2]](diffhunk://#diff-6838c5ae83d3adbfc970dc7631571a2708561bbcae4672068116908d076c1250R35) [[3]](diffhunk://#diff-6838c5ae83d3adbfc970dc7631571a2708561bbcae4672068116908d076c1250R46) [[4]](diffhunk://#diff-6e54bae7eb279d4de726526f2446f28114357473cdb25013a9fd91c0aba0c890L54-R68) [[5]](diffhunk://#diff-6e54bae7eb279d4de726526f2446f28114357473cdb25013a9fd91c0aba0c890R82) [[6]](diffhunk://#diff-721c0b8ae59bb11c1bfdc2470a617d761602dddc73d8d0b38550123b76145cd9L18-R29) * Updated arena allocator implementations in both the core and CUDA providers to increment and decrement `bytes_requested_in_use` appropriately during allocation, reservation, splitting, and freeing of memory chunks. [[1]](diffhunk://#diff-1823f1652f5e340ce0680ac864fc6045e5a4b1fdb948fbe9d39580aeb431b053R286) [[2]](diffhunk://#diff-1823f1652f5e340ce0680ac864fc6045e5a4b1fdb948fbe9d39580aeb431b053R392) [[3]](diffhunk://#diff-1823f1652f5e340ce0680ac864fc6045e5a4b1fdb948fbe9d39580aeb431b053R483) [[4]](diffhunk://#diff-1823f1652f5e340ce0680ac864fc6045e5a4b1fdb948fbe9d39580aeb431b053R641) [[5]](diffhunk://#diff-f9cc3e2e013a7c61bd104c801f99b384f2b7fd059748ad9c764e10127d225af1R308) [[6]](diffhunk://#diff-f9cc3e2e013a7c61bd104c801f99b384f2b7fd059748ad9c764e10127d225af1R486) [[7]](diffhunk://#diff-f9cc3e2e013a7c61bd104c801f99b384f2b7fd059748ad9c764e10127d225af1R574) [[8]](diffhunk://#diff-f9cc3e2e013a7c61bd104c801f99b384f2b7fd059748ad9c764e10127d225af1R646) **CUDA and plugin support:** * Modified CUDA mempool allocators and plugins to report `bytes_requested_in_use` (equal to `bytes_in_use` since there is no padding in mempool allocators), ensuring consistent reporting across all allocator types. [[1]](diffhunk://#diff-86883e03b8b4d52cb205510e2651db8d99cd0c75baf074de08d0bd8342bb8a59R214) [[2]](diffhunk://#diff-cb345155e851bfe3c54f94aca9b156076e51b9774badd487d255de247649c794R364) **Adapter and API changes:** * Updated the allocator adapter logic to parse, propagate, and serialize the new `RequestedInUse` field in key-value pairs, enabling plugins and external allocators to participate in the enhanced memory profiling. [[1]](diffhunk://#diff-d8a86badc40a2b14be792fcc33abb30a1076a8c0a60382ea012c850d2649e099R53-R54) [[2]](diffhunk://#diff-d8a86badc40a2b14be792fcc33abb30a1076a8c0a60382ea012c850d2649e099R149) [[3]](diffhunk://#diff-0b74f1cc41b5d8a45998962bbf35d1dfcf35dcde30de5754e36fd4c583b506e9R10-R11) [[4]](diffhunk://#diff-0b74f1cc41b5d8a45998962bbf35d1dfcf35dcde30de5754e36fd4c583b506e9R60-L66) **Kernel execution memory profiling:** * Enhanced the `KernelScope` in the sequential executor to sample and emit both `bytes_in_use` and `bytes_requested_in_use` before and after kernel execution, providing more granular memory profiling in event logs. [[1]](diffhunk://#diff-ee1124ddd8fa0e41f83cd9f2f69fbae0cf747370dd724e5b1a1b2875abcb8f05R394-R406) [[2]](diffhunk://#diff-ee1124ddd8fa0e41f83cd9f2f69fbae0cf747370dd724e5b1a1b2875abcb8f05R437-R452) [[3]](diffhunk://#diff-ee1124ddd8fa0e41f83cd9f2f69fbae0cf747370dd724e5b1a1b2875abcb8f05R491-R497) **Build system minor fix:** * Added the `/bigobj` compiler flag for C++ targets in the CUDA provider CMake file to prevent object file size limitations on MSVC.

References

#29058 - Add memory stats to profiling

Author

yuslepukhin

Parents

48d2caca

onnxruntime bf2fcd1f - Add memory stats to profiling (#29058)

onnxruntime
bf2fcd1f - Add memory stats to profiling (#29058)