benchmark
dc955607 - Add distributed correctness checks (#1294)

Commit

3 years ago

Add distributed correctness checks (#1294) Summary: Pull Request resolved: https://github.com/pytorch/benchmark/pull/1294 DDP+Dynamo experiments run in different processes, and DDP also occurs after the general correctness check - so they need to be handled separately. Process: 1) When setting up the measurements in `ddp_experiments/__init__.py`, categorize some measurements as "reference" (e.g. the eager measurements) and others as "test". Also assign a file path. 2) For reference measurements, do an initial correctness measurement and dump the results into a file. For test measurements, load the reference measurements from the file and compare the initial correctness measurements. 3) Make sure to run the correctness measurement on all ranks, even if we're only doing the correctness check on rank 0. 4) Make sure to disable non-distributed correctness checks to avoid an additional iteration that might affect dynamo and eager parameters differently. Currently hf_T5, hf_T5_large, hf_Bert, hf_GPT2_large, and timm_vision_transformer are passing; ~resnet50 is failing. Still investigating the resnet50 issue.~ resnet50 is also failing correctness with `python run.py resnet50 -t train -d cuda --torchdynamo inductor`, so this isn't a DDP-specific problem. Usage: ``` python userbenchmark/ddp_experiments/__init__.py --job_dir /fsx/path/to/dir/shared/across/cluster --check_correctness_distributed ``` i.e. adding --check_correctness_distributed will add correctness checks. Note that the correctness checks here shouldn't be used if you care about performance; we modify the models with model.eval() to get rid of dropouts, so it's not representative of actual performance. Since we probably don't care about performance anyway when using this option, we also reduce the number of iterations to speed up the test time. Test Plan: Imported from OSS Reviewed By: wconstab Differential Revision: D41312110 Pulled By: davidberard98 fbshipit-source-id: c393923e2eac89209418abde5acb897fd382b6ba

Author

davidberard98

Committer

facebook-github-bot

Parents

38196605

benchmark dc955607 - Add distributed correctness checks (#1294)

benchmark
dc955607 - Add distributed correctness checks (#1294)