Compare DDP static graph (C++ core) with legacy DDP forward and backward delay. (#61507)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61507
Benchmark Python-only DDP vs production C++ based DistributedDataParallel.
- Implemented a pure python DDP: PythonDDP with support of SYNC and ASYNC reduction
- Added compare_ddp to measure the difference in forward and backward step
Kudos on Shen and Yi for the great idea.
Test Plan:
Test on DevGPUS with 2 CUDA devices.
$python compare_ddp.py
Python only DDP has slightly better (-1%) forward performance and slightly slower (2%-20%) backward performance.
This suggested that we need to keep C++ Core since the maximum latency increase can be 20%. See README.md for details.
Imported from OSS
Differential Revision:
D29685364
D29685364
Reviewed By: mrshenli
Pulled By: bowangbj
fbshipit-source-id: 429e4473fac0ec4c70d6db12d946d2636dd6477a