DeepSpeed
let allgather and alltoall execute in parallel when both attention and MOE used TP
#7723
Open

Loading