Add memory stats to profiling (#29058)
This pull request introduces enhanced memory profiling capabilities by
adding a new metric, `bytes_requested_in_use`, to allocator statistics
throughout the ONNX Runtime codebase. This metric tracks the memory
actually requested by user code, excluding internal fragmentation and
padding, and is now reported alongside existing memory usage statistics.
The changes span core framework allocators, CUDA providers, plugin
interfaces, and kernel execution profiling.
**Allocator statistics improvements:**
* Added a new field, `bytes_requested_in_use`, to the `AllocatorStats`
struct in multiple locations (core, CUDA, plugin, and test), which
tracks the number of bytes actually requested by user code, distinct
from total bytes in use that may include internal padding. This field is
now initialized, serialized, and included in string/key-value
representations.
[[1]](diffhunk://#diff-6838c5ae83d3adbfc970dc7631571a2708561bbcae4672068116908d076c1250L17-R18)
[[2]](diffhunk://#diff-6838c5ae83d3adbfc970dc7631571a2708561bbcae4672068116908d076c1250R35)
[[3]](diffhunk://#diff-6838c5ae83d3adbfc970dc7631571a2708561bbcae4672068116908d076c1250R46)
[[4]](diffhunk://#diff-6e54bae7eb279d4de726526f2446f28114357473cdb25013a9fd91c0aba0c890L54-R68)
[[5]](diffhunk://#diff-6e54bae7eb279d4de726526f2446f28114357473cdb25013a9fd91c0aba0c890R82)
[[6]](diffhunk://#diff-721c0b8ae59bb11c1bfdc2470a617d761602dddc73d8d0b38550123b76145cd9L18-R29)
* Updated arena allocator implementations in both the core and CUDA
providers to increment and decrement `bytes_requested_in_use`
appropriately during allocation, reservation, splitting, and freeing of
memory chunks.
[[1]](diffhunk://#diff-1823f1652f5e340ce0680ac864fc6045e5a4b1fdb948fbe9d39580aeb431b053R286)
[[2]](diffhunk://#diff-1823f1652f5e340ce0680ac864fc6045e5a4b1fdb948fbe9d39580aeb431b053R392)
[[3]](diffhunk://#diff-1823f1652f5e340ce0680ac864fc6045e5a4b1fdb948fbe9d39580aeb431b053R483)
[[4]](diffhunk://#diff-1823f1652f5e340ce0680ac864fc6045e5a4b1fdb948fbe9d39580aeb431b053R641)
[[5]](diffhunk://#diff-f9cc3e2e013a7c61bd104c801f99b384f2b7fd059748ad9c764e10127d225af1R308)
[[6]](diffhunk://#diff-f9cc3e2e013a7c61bd104c801f99b384f2b7fd059748ad9c764e10127d225af1R486)
[[7]](diffhunk://#diff-f9cc3e2e013a7c61bd104c801f99b384f2b7fd059748ad9c764e10127d225af1R574)
[[8]](diffhunk://#diff-f9cc3e2e013a7c61bd104c801f99b384f2b7fd059748ad9c764e10127d225af1R646)
**CUDA and plugin support:**
* Modified CUDA mempool allocators and plugins to report
`bytes_requested_in_use` (equal to `bytes_in_use` since there is no
padding in mempool allocators), ensuring consistent reporting across all
allocator types.
[[1]](diffhunk://#diff-86883e03b8b4d52cb205510e2651db8d99cd0c75baf074de08d0bd8342bb8a59R214)
[[2]](diffhunk://#diff-cb345155e851bfe3c54f94aca9b156076e51b9774badd487d255de247649c794R364)
**Adapter and API changes:**
* Updated the allocator adapter logic to parse, propagate, and serialize
the new `RequestedInUse` field in key-value pairs, enabling plugins and
external allocators to participate in the enhanced memory profiling.
[[1]](diffhunk://#diff-d8a86badc40a2b14be792fcc33abb30a1076a8c0a60382ea012c850d2649e099R53-R54)
[[2]](diffhunk://#diff-d8a86badc40a2b14be792fcc33abb30a1076a8c0a60382ea012c850d2649e099R149)
[[3]](diffhunk://#diff-0b74f1cc41b5d8a45998962bbf35d1dfcf35dcde30de5754e36fd4c583b506e9R10-R11)
[[4]](diffhunk://#diff-0b74f1cc41b5d8a45998962bbf35d1dfcf35dcde30de5754e36fd4c583b506e9R60-L66)
**Kernel execution memory profiling:**
* Enhanced the `KernelScope` in the sequential executor to sample and
emit both `bytes_in_use` and `bytes_requested_in_use` before and after
kernel execution, providing more granular memory profiling in event
logs.
[[1]](diffhunk://#diff-ee1124ddd8fa0e41f83cd9f2f69fbae0cf747370dd724e5b1a1b2875abcb8f05R394-R406)
[[2]](diffhunk://#diff-ee1124ddd8fa0e41f83cd9f2f69fbae0cf747370dd724e5b1a1b2875abcb8f05R437-R452)
[[3]](diffhunk://#diff-ee1124ddd8fa0e41f83cd9f2f69fbae0cf747370dd724e5b1a1b2875abcb8f05R491-R497)
**Build system minor fix:**
* Added the `/bigobj` compiler flag for C++ targets in the CUDA provider
CMake file to prevent object file size limitations on MSVC.