Port `parallel_state.py` (`mpu`) from Megatron-Deepspeed
Since we need `mpu` for Ulysses outside of Meg-DS we need the mpu code, so this PR ports the code.
It appears non-trivial to just trim this file to SP groups as DS calls into many other methods of this class if `mpu is not None`.