onnxruntime
d1b85f5f - Reduce LLaMA memory usage (#18181)

Commit
2 years ago
Reduce LLaMA memory usage (#18181) ### Description This PR reduces the memory usage when exporting and benchmarking LLaMA. ### Motivation and Context - Exporting: The PyTorch model is deleted from memory after a successful export instead of deleting it from memory after exporting + converting the ONNX model to the desired precision. - Benchmarking: In the ONNX model with GroupQueryAttention, the KV cache inputs use the same GPU memory for both the prompt and token generation benchmarks.
Parents
Loading