onnxruntime
30c5f059 - Add Paged Attention Op for CUDA SM80 support (#24595)

Commit

200 days ago

Add Paged Attention Op for CUDA SM80 support (#24595) ### Description Adds Paged Attention Op which enables of Paged KV Cache. Inputs to this op are unpadded (packed / varlen) so Cumulative Sequence Lengths are a required input. ### Motivation and Context Adding this op to ONNXRuntime is necessary to allow the GenAI team to enable a continuous batching server API.

References

#24595 - Add Paged Attention Op for CUDA SM80 support

Author

aciddelgado

Parents

2b435364

onnxruntime 30c5f059 - Add Paged Attention Op for CUDA SM80 support (#24595)

onnxruntime
30c5f059 - Add Paged Attention Op for CUDA SM80 support (#24595)