Utilize ext data location to reduce qd matmul memory usage (#21451)
### Description
When the graph is quantized to qdq format, the DQ + MatMul is
transformed to MatMulNBits in the level 2 optimizer when the model is
initialized in an inference session.
In the transformation step, tensors are transposed and new tensor protos
are created. Instead of using protobuf arena allocated memory, the PR
sets the tensor proto to use external buffer, and point the external
location to memory location which contains the tensor buffer allocated
by CPU.
Then, in the step that creates OrtValue using the tensor proto, the
memory buffers in the tensor proto are directly assigned to the tensors
which were originally allocated by Ort Arena.
With these two steps, the peak memory usage of QDQ format model is the
same as usage of QOperator model. Besides, the model initialization time
is significantly reduced. Take
[Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)
for example:
|| QOperator Model (MatMulNBits) | QDQ Model (DQ + MatMul, original
code) | QDQ Model (this PR) |
|---|---|---|---|
| peak memory consumption | 2.8 GB | ~4.8 GB | 2.8 GB |
| initialization time | 3 sec | 9 sec | 5 sec |
### Motivation and Context
When the graph is quantized to qdq format, the DQ + MatMul is converted
to MatMulNBits in the level 2 optimizer.
Originally, the newly created tensor proto use memory allocated by
protobuf arena. These memory usage cannot be fully released when the
tensor protos are deleted.
Then, in the tensor proto to OrtValue step, tensors are created using
ORT arena. Later, in the pre-pack step for MatMulNBits, new OrtValues
are created. The tensors in the ORT arena are not fully released as
well.
The two arena memory allocation steps in the DQ + MatMul -> MatMulNBits
transformation will result in almost 2x memory consumption in the model
initialization.