Add option to memory map .ORT model loads (#28164)
Addressing issue #25524 (MS internal: 60577894)
Today, the closest method callers have to loading models from a shared
resource is by mapping the model themselves and using
use_ort_model_bytes_directly - this puts the responsibility on the
caller to ensure the validity of the mapping as well. These changes
introduce use_memory_mapped_ort_model, a session option for using
memory-mapped I/O to load ORT format models directly inside OnnxRuntime.
The mapping in this case is owned by the InferenceSession. The changes
to implement this are simple and minimal and use ORT's existing
platform-agnostic memory mapping helpers, and if we choose to make this
the default behavior could mean automatic memory savings for
multi-process usage.
### Note about memory implications & sharing model bytes:
The reality of this change is that using use_memory_mapped_ort_model
alone doesn't have a long-running memory usage advantage because ORT
will ultimately copy the model bytes from the mapped pages into Tensors.
Using it in coordination with
_session.use_ort_model_bytes_for_initializers_ ensures that that
initializers point directly to the flatbuffer bytes and avoids the extra
copy. This would be the expected usage for multi-process sharing of a
single model. This introduces questions around what the default behavior
should be - the changes I made in this PR are conservative and retain
all existing defaults at this time.
**Changes**
- **onnxruntime_session_options_config_keys.h** — New
session.use_memory_mapped_ort_model config key
- **inference_session.h** — Added Env::MappedMemoryPtr member to hold
the file mapping; updated existing comments to document the mmap path
- **inference_session.cc** — New LoadOrtModelBytesMapped() static
function; updated LoadOrtModel(PathString) to check config and use mmap;
updated Initialize() cleanup to release the mapping; updated comment
on initializer gating to note mmap case
- **ort_model_only_test.cc** — Two new tests:
LoadOrtFormatModelMemoryMapped and
LoadOrtFormatModelMemoryMappedWithInitializersFromMap
- Also checking in a benchmarking tool, benchmark_mmap_ort.py, just for
preservation, but this is optional and can be omitted.
- Added a flag to the perf tests used by the benchmark to hold onto the
session for a specified amount of time - useful for measuring memory
sharing changes. We can revert these and exclude the benchmark if they
are not desired for check-in.
### **Benchmark Examples**
Note that the benchmark is largely written by GHCP and may not be
perfect, but I've validated some of its results.
**Single-Proc**
Here is a sample result from a single-process benchmark using resnet50
(converted to ORT format). Note that these measure peaks during
construction and not end-states, and the measurements may be imperfect.
`python tools/python/benchmark_mmap_ort.py --perf-test
build\Windows\Release\Release\onnxruntime_perf_test.exe --model
resnet50.ort --iterations 15`
| Configuration | Session Creation (ms) | Peak Private Commit (MB) |
Peak Working Set (MB) | Session vs baseline | Private vs baseline |
|---|---|---|---|---|---|
| .ort standard load (baseline) | 193.13 | 222.9 | 235.9 | — | — |
| .ort memory-mapped load | 120.95 | 125.7 | 236.1 | **-37.4%** |
**-43.6%** |
| .ort mmap + direct initializers | 14.87 | 109.6 | 120.6 | **-92.3%** |
**-50.8%** |
**Multi-Proc**
The multi-proc benchmark shows that total memory bandwidth gains for
shared models can only be obtained alongside
use_ort_model_bytes_for_initializers_
| Configuration (4 processes) | Total Private (MB) | Total Working Set
(MB) | Private vs baseline |
|---|---|---|---|
| .ort standard load (baseline) | 462.6 | 519.0 | — |
| .ort memory-mapped load | 462.1 | 518.5 | -0.1% |
| .ort mmap + direct initializers | 98.2 | 187.8 | **-78.8%** |
---------
Co-authored-by: Kevin Taha <kevintaha@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>