onnxruntime
f610605a - [DML EP] Support partial rotary embedding (#22417)

Commit

1 year ago

[DML EP] Support partial rotary embedding (#22417) ### Description This adds support for partial RotaryEmbedding to DML. Essentially, partial RotaryEmbedding simply consists of doing the rotary embedding calculation on a subregion of the input tensor of as if its head size was `rotary_embedding_dim`, while leaving the second part of the tensor (i.e. `head_size - rotary_embedding_dim`) alone. To achieve this, all we need to do is follow the following steps: 1. Split the tensor into 2 parts 2. Run the rotary embedding algorithm on the first part, just like we were doing before on the entire tensor 3. Join the 2 parts back together Since we're leaving the middle part intact, the RotaryEmbedding fusion will still be done within DML. Also, the concat at the end is essentially free because DML optimizes it out and directly allocate the result of RotaryEmbedding at the right place. The only overhead here is the splitting of the tensor at the beginning, which we should eventually make part of the RotaryEmbedding fusion within DML. ### Motivation and Context This fix allows us to correctly run models that have a `partial_rotary_factor` setting in huggingface, including Nvidia's Nemotron: https://huggingface.co/nvidia/Nemotron-Mini-4B-Instruct

References

#22417 - [DML EP] Support partial rotary embedding

Author

PatriceVignola

Parents

a164228c

onnxruntime f610605a - [DML EP] Support partial rotary embedding (#22417)

onnxruntime
f610605a - [DML EP] Support partial rotary embedding (#22417)