Add new model: simple_gpt_tp_manual (#1969)
Summary:
Similar to simple_gpt, but instead of using the DTensor API to apply Tensor Parallelism (TP), we use the manual weights sharding implementation and directly functional collectives. 2 main reasons it is beneficial to add this:
1. DTensor + compile is not ready yet
2. DTensor has a CPU overhead, and adding this less overhead model will help us track the improvement/regression
Tests:
in benchmark/
python test.py -k "test_simple_gpt_manual_tp_"
in pytorch/
PYTHONPATH=benchmark/ python pytorch/benchmarks/dynamo/torchbench.py --float16 -dcuda --inference --backend=inductor --multiprocess --performance --only simple_gpt_tp_manual
Pull Request resolved: https://github.com/pytorch/benchmark/pull/1969
Reviewed By: xuzhao9
Differential Revision: D50130401
Pulled By: xmfan
fbshipit-source-id: cd4b5e543919024ff6c42c6fccfc0b12367d9bb2