SemanticDiff

pytorch
5516fe12 - [cpu] implement scaled dot product flash attention (#103826)

Commit View On GitHub

Login via GitHub
Home
Pricing
FAQ
Install

Login via GitHub

Commit

1 year ago

[cpu] implement scaled dot product flash attention (#103826) Feature RFC: https://github.com/pytorch/rfcs/pull/56. The flash attention CPU kernel is added, for forward path FP32. Blocking is applied on dimensions of query length and kv length and the fusion of gemm + softmax update + gemm is done at once for each block. Parallelization is on the dimensions of batch size, head number and query length. In addition, the causal attention mask is supported. As the attention is masked for the unseen tokens, early termination is applied and we only calculate the blocks in the lower triangular part. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103826 Approved by: https://github.com/drisspg, https://github.com/jgong5 ghstack dependencies: #104583, #104584

Author

Valentine233

Valentine233

Committer

pytorchmergebot

pytorchmergebot

Parents

FAQ Terms Privacy Refunds Impressum

Loading