onnxruntime
69cfcba3 - [CUDA] Sparse Attention support 128k sequence length (#20614)

Commit

1 year ago

[CUDA] Sparse Attention support 128k sequence length (#20614) ### Description When sequence length is 128K, block_mask has 2048 rows, that is not supported by previous kernel. (1) Add a new kernel to handle more than 1024 rows, and each thread need handle two rows. (2) Add a test for sequence length 128k.

References

#20614 - [CUDA] Sparse Attention support 128k sequence length

Author

tianleiwu

Parents

a0db2187

onnxruntime 69cfcba3 - [CUDA] Sparse Attention support 128k sequence length (#20614)

onnxruntime
69cfcba3 - [CUDA] Sparse Attention support 128k sequence length (#20614)