pytorch
c9ba967c - Upstream xformers code (#100583)

Commit
2 years ago
Upstream xformers code (#100583) # Summary Since the initial upstream of memory efficient attention from xformers: #86157, significant work updates have been made to the kernel including - increased performance, bug-fixes, and added functionality. This PR upstreams the latest version of this kernel as of: version 0.0.20 or commit: [6425fd0cacb1a6579aa2f0c4a570b737cb10e9c3](https://github.com/facebookresearch/xformers/commit/6425fd0cacb1a6579aa2f0c4a570b737cb10e9c3) ## Future Although this version of the Kernel has support for dropout and arbitrary attention bias, I did not add this support to SDPA yet, and left the guards in sdp_utils. Those will follow up PRs in order to reduce the scope creep of these substantial changes, and ensure that nothing is broken. ## Specific Changes ### Minor Changes * The build system work was done in the previous PR and so no changes were needed to CMAKE 🤞 * Adding the new files and re-arranging/creating folder structure * Updating include paths * Switching from xformer specific functions: `XFORMERS_CHECK -> TORCH_CHECK` * Changes to xformer specific macros * Updates to the `generate_kernels.py` to use account for Pytorch file structure, also added an arg parse that I could run on a test dir before creating the files in place. ### Bigger Changes * Previous Kernel changes "Removed the chunk optimization: see discussion here: https://github.com/pytorch/pytorch/pull/96880" * Increased the number of cuda kernels -> potentially effecting the cuda_lib size. * Preemptively made changes to the dtypes of seed and offset in order to allow for cuda_graphs: #100196 this is not finished. * Made VERY BC breaking changes to at::_efficient_attention_forward and at::_efficeint_attention_backward function signatures. * I made these changes due to in part to the ability for this PR to land:https://github.com/pytorch/pytorch/pull/100196 ### Due Diligence Checks: * CUDA_lib size: * Before: 496 MiB * After: 496MiB * Performance Sweep: * I sweeped over 576 configs for forward only inference and the geomean speedup was 0.98x with a min speed up of 0.84 and a max speedup of 1.2 * For Forw+Back running on 270 configs ( to reduce memory) the geomean speedup was 1.02X with a min speed up of 1.02 and a max speedup of 1.35. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100583 Approved by: https://github.com/cpuhrsch
Author
Committer
Parents
Loading