DeepSpeed
474a3288 - Enabled Qwen2-MoE Tensor Parallelism (TP) inference (#6551)

Commit

1 year ago

Enabled Qwen2-MoE Tensor Parallelism (TP) inference (#6551) Modified _replace_module in auto_tp.py : The modification keeps the layers 'shared_expert_gate' and 'gate' in qwen2-moe the original type torch.nn.Linear and not changes them into LinearLayer. In this way, their weights will not be split into multiple HPU/GPU cards. Then the qwen2-moe can run on multiple HPU/GPU cards. Since the weights of 'gate' are not split into multiple HPU/GPU cards, all gather operations are not needed, which may improve performance. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

References

loadams/a6000-fix-0-15-2

#6551 - Enabled Qwen2-MoE Tensor Parallelism (TP) inference

Author

gyou2021

Parents

1062a0c6

DeepSpeed 474a3288 - Enabled Qwen2-MoE Tensor Parallelism (TP) inference (#6551)

DeepSpeed
474a3288 - Enabled Qwen2-MoE Tensor Parallelism (TP) inference (#6551)