onnxruntime
01dd991f - Update SparseAttention op spec to make it more flexible (#20625)

Commit

1 year ago

Update SparseAttention op spec to make it more flexible (#20625) ### Description Make the operator more flexible: (1) Decouple the max sequence length of rotary cache, kv cache and block mask. They are allowed to have different values. (2) Replace block_mask dense by CSR format (block_row_indices and block_col_indices) to improve performance. (3) Mark past_key and past_value as required inputs since we need them to compute the shape of present_key and present_value. ### Motivation and Context (1) LongRoPE has short and long rotary cache, which has different length. (2) Most users do not have enough GPU memory to run maximum sequence length 128K. This change allows user to use smaller kv cache length to test without hitting out of memory.

References

#20625 - Update SparseAttention op spec to make it more flexible

Author

tianleiwu

Parents

a0c4bd4d

onnxruntime 01dd991f - Update SparseAttention op spec to make it more flexible (#20625)

onnxruntime
01dd991f - Update SparseAttention op spec to make it more flexible (#20625)