onnxruntime
69cfcba3 - [CUDA] Sparse Attention support 128k sequence length (#20614)

Commit
1 year ago
[CUDA] Sparse Attention support 128k sequence length (#20614) ### Description When sequence length is 128K, block_mask has 2048 rows, that is not supported by previous kernel. (1) Add a new kernel to handle more than 1024 rows, and each thread need handle two rows. (2) Add a test for sequence length 128k.
Author
Parents
Loading