reduce all-to-all communication volume when both expert and non-expert are tensor-parallel (#5626)
Example: E + M + D parallel
world_size = 8
model_degree = 2
expert_degree = 4
mp_group = [0, 1], [2,3], [4,5],[6,7]
expert_parallel_group = [0,2,4,6], [1,3,5,7]
The original execution method was that before executing Expert, there
was no drop operation, and two EPs did all-to-all separately. In the
end, they both obtained complete data, but 0 and 1 obtained exactly the
same data. Similarly, 2, 3, and so on all obtained the same data.
Therefore, we can drop the data before executing all-to-all, and then
execute allgather after all-to-all to obtain the complete data.
After executing Expert, the data on 0 and 1 is exactly the same, so we
can drop it and then execute all-to-all , and then execute allgather to
obtain the complete data.
1. non-expert use TP, expert not use TP: drop -> alltoall -> exe MOE ->
alltoall -> allgather
2. both non-expert and expert all use TP:
- the original execution order: alltoall -> exe MOE-> allreduce ->
alltoall
- optimized execution order: drop -> alltoall -> allgather -> exe MOE ->
drop ->alltoall -> allgather
Signed-off-by: --local <zhiwei.tao@enflame-tech.com>
Co-authored-by: --local <zhiwei.tao@enflame-tech.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>