Deepseek v4 csa mask collapse (#45928)

Commit

6 days ago

Deepseek v4 csa mask collapse (#45928) [deepseek_v4] collapse CSA block_bias from [S, S*top_k] to [S, compressed_len] Each query's top-k entries select the same K rows in both representations, so the softmax is math-identical — but the dense form has S*top_k columns that are mostly -inf, while the sparse form has compressed_len columns with top_k zeros per row. Drops the index_select gather + the [B, 1, S, S, top_k] 5D allocation, and shrinks the attention KV axis from S*top_k to compressed_len (~2000× fewer rows at S=234, k=512 prefill). Co-authored-by: Sawyer117 <Sawyer117@users.noreply.github.com>

References

#45928 - Deepseek v4 csa mask collapse

Author

ArthurZucker

Parents

9c8a1b83

transformers 2ad5a9b8 - Deepseek v4 csa mask collapse (#45928)

transformers
2ad5a9b8 - Deepseek v4 csa mask collapse (#45928)