Long sequence parallelism (Ulysses) integration with HuggingFace (#5774)
This PR enhances capabilities of [DeepSpeed long sequence (context)
parallelism (aka DS
Ulysses)](https://dl.acm.org/doi/10.1145/3662158.3662806) with support
for HuggingFace (and by extension other frameworks) models. With HF
integration, users can use sequence parallelism for model
pre/mid/post-training, finetuning etc. Usage requires both _torch
>=2.2.2 and flash-attention_. ZeRO-1 and 2 are supported, ZeRO-3 and
SPDA support in progress. Corresponding PR in HF is
[PR32305](https://github.com/huggingface/transformers/pull/32305).
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>