Add Unified Sequence Parallel attention (#12693)
* initial scheme of unified-sp
* initial all_to_all_double
* bug fixes, added cmnts
* unified attention prototype done
* remove raising value error in contextParallelConfig to enable unified attention
* bug fix
* feat: Adds Test for Unified SP Attention and Fixes a bug in Template Ring Attention
* bug fix, lse calculation, testing
bug fixes, lse calculation
-
switched to _all_to_all_single helper in _all_to_all_dim_exchange due contiguity issues
bug fix
bug fix
bug fix
* addressing comments
* sequence parallelsim bug fixes
* code format fixes
* Apply style fixes
* code formatting fix
* added unified attention docs and removed test file
* Apply style fixes
* tip for unified attention in docs at distributed_inference.md
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* Update distributed_inference.md, adding benchmarks
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* Update docs/source/en/training/distributed_inference.md
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* function name fix
* fixed benchmark in docs
---------
Co-authored-by: KarthikSundar2002 <karthiksundar30092002@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>