Reduce LLaMA memory usage (#18181)

Commit

2 years ago

Reduce LLaMA memory usage (#18181) ### Description This PR reduces the memory usage when exporting and benchmarking LLaMA. ### Motivation and Context - Exporting: The PyTorch model is deleted from memory after a successful export instead of deleting it from memory after exporting + converting the ONNX model to the desired precision. - Benchmarking: In the ONNX model with GroupQueryAttention, the KV cache inputs use the same GPU memory for both the prompt and token generation benchmarks.

References

#18181 - Reduce LLaMA memory usage

Author

kunal-vaishnavi

Parents

2b95e74f

onnxruntime d1b85f5f - Reduce LLaMA memory usage (#18181)

onnxruntime
d1b85f5f - Reduce LLaMA memory usage (#18181)