pytorch
98a5cf09 - [SDPA] Remove the chunk_grad from mem-eff attention (#96880)

Commit

1 year ago

[SDPA] Remove the chunk_grad from mem-eff attention (#96880) # Summary There exists an optimization within the scaled_dot_product_efficieint bacwkard attention path to, under the right conditions, output grad_q, grad_k, grad_v all as aliases of the same storage. This was done to optimize for the hot path where mha does packed linear_projection -> chunk -> (view stuff) -> sdpa. The thought was that chunk-> would be able to "trivially" cat inputs to chunk.backward(). However upon closer inspection chunk.backward will call ` cat` irregardless of the inputs so this is not being utilized. I validated this by profiling on main and then this branch and the traces produced the same both with `split.backward()` calling into cat. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96880 Approved by: https://github.com/cpuhrsch

Author

drisspg

Committer

pytorchmergebot

Parents

d4b8ed2b

pytorch 98a5cf09 - [SDPA] Remove the chunk_grad from mem-eff attention (#96880)

pytorch
98a5cf09 - [SDPA] Remove the chunk_grad from mem-eff attention (#96880)