Deepseek v4 csa mask collapse (#45928)
[deepseek_v4] collapse CSA block_bias from [S, S*top_k] to [S, compressed_len]
Each query's top-k entries select the same K rows in both representations, so
the softmax is math-identical — but the dense form has S*top_k columns that are
mostly -inf, while the sparse form has compressed_len columns with top_k zeros
per row. Drops the index_select gather + the [B, 1, S, S, top_k] 5D allocation,
and shrinks the attention KV axis from S*top_k to compressed_len (~2000× fewer
rows at S=234, k=512 prefill).
Co-authored-by: Sawyer117 <Sawyer117@users.noreply.github.com>