Fix the bug of deepspeed sequence parallel working with batch size larger than 1 (#5823)
Modified the `alltoall` function
Verified the results with only `TP`:

---------
Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw02.ten.osc.edu>
Co-authored-by: Sam Ade Jacobs <samjacobs@microsoft.com>
Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw01.ten.osc.edu>
Co-authored-by: Logan Adams <loadams@microsoft.com>